Unicode Support

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
51 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Unicode Support

EuanM
I'm currently groping my way to seeing how feature-complete our
Unicode support is.  I am doing this to establish what still needs to
be done to provide full Unicode support.

This seems to me to be an area where it would be best to write it
once, and then have the same codebase incorporated into the Smalltalks
that most share a common ancestry.

I am keen to get: equality-testing for strings; sortability for
strings which have ligatures and diacritic characters; and correct
round-tripping of data.

Call to action:
==========

If you have comments on these proposals - such as "but we already have
that facility" or "the reason we do not have these facilities is
because they are dog-slow" - please let me know them.

If you would like to help out, please let me know.

If you have Unicode experience and expertise, and would like to be, or
would be willing to be, in the  'council of experts' for this project,
please let me know.

If you have comments or ideas on anything mentioned in this email

In the first instance, the initiative's website will be:
http://smalltalk.uk.to/unicode.html

I have created a SqueakSource.com project called UnicodeSupport

I want to avoid re-inventing any facilities which already exist.
Except where they prevent us reaching the goals of:
  - sortable UTF8 strings
  - sortable UTF16 strings
  - equivalence testing of 2 UTF8 strings
  - equivalence testing of 2 UTF16 strings
  - round-tripping UTF8 strings through Smalltalk
  - roundtripping UTF16 strings through Smalltalk.
As I understand it, we have limited Unicode support atm.

Current state of play
===============
ByteString gets converted to WideString when need is automagically detected.

Is there anything else that currently exists?

Definition of Terms
==============
A quick definition of terms before I go any further:

Standard terms from the Unicode standard
===============================
a compatibility character : an additional encoding of a *normal*
character, for compatibility and round-trip conversion purposes.  For
instance, a 1-byte encoding of a Latin character with a diacritic.

Made-up terms
============
a convenience codepoint :  a single codepoint which represents an item
that is also encoded as a string of codepoints.

(I tend to use the terms compatibility character and compatibility
codepoint interchangably.  The standard only refers to them as
compatibility characters.  However, the standard is determined to
emphasise that characters are abstract and that codepoints are
concrete.  So I think it is often more useful and productive to think
of compatibility or convenience codepoints).

a composed character :  a character made up of several codepoints

Unicode encoding explained
=====================
A convenience codepoint can therefore be thought of as a code point
used for a character which also has a composed form.

The way Unicode works is that sometimes you can encode a character in
one byte, sometimes not.  Sometimes you can encode it in two bytes,
sometimes not.

You can therefore have a long stream of ASCII which is single-byte
Unicode.  If there is an occasional Cyrillic or Greek character in the
stream, it would be represented either by a compatibility character or
by a multi-byte combination.

Using compatibility characters can prevent proper sorting and
equivalence testing.

Using "pure" Unicode, ie. "normal encodings", can cause compatibility
and round-tripping probelms.  Although avoiding them can *also* cause
compatibility issues and round-tripping problems.

Currently my thinking is:

a Utf8String class
an Ordered collection, with 1 byte characters as the modal element,
but short arrays of wider strings where necessary
a Utf16String class
an Ordered collection, with 2 byte characters as the modal element,
but short arrays of wider strings
beginning with a 2-byte endianness indicator.

Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.

So my thinking is that Utf8String will contain convenience codepoints,
for round-tripping.  And where there are multiple convenience
codepoints for a character, that it standardises on one.

And that there is a Utf8SortableString which uses *only* normal characters.

We then need methods to convert between the two.

aUtf8String asUtf8SortableString

and

aUtf8SortableString asUtf8String


Sort orders are culture and context dependent - Sweden and Germany
have different sort orders for the same diacritic-ed characters.  Some
countries have one order in general usage, and another for specific
usages, such as phone directories (e.g. UK and France)

Similarly for Utf16 :  Utf16String and Utf16SortableString and
conversion methods

A list of sorted words would be a SortedCollection, and there could be
pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
seOrder, ukOrder, etc

along the lines of
aListOfWords := SortedCollection sortBlock: deOrder

If a word is either a Utf8SortableString, or a well-formed Utf8String,
then we can perform equivalence testing on them trivially.

To make sure a Utf8String is well formed, we would need to have a way
of cleaning up any convenience codepoints which were valid, but which
were for a character which has multiple equally-valid alternative
convenience codepoints, and for which the string currently had the
"wrong" convenience codepoint.  (i.e for any character with valid
alternative convenience codepoints, we would choose one to be in the
well-formed Utf8String, and we would need a method for cleaning the
alternative convenience codepoints out of the string, and replacing
them with the chosen approved convenience codepoint.

aUtf8String cleanUtf8String

With WideString, a lot of the issues disappear - except
round-tripping(although I'm sure I have seen something recently about
4-byte strings that also have an additional bit.  Which would make
some Unicode characters 5-bytes long.)


(I'm starting to zone out now - if I've overlooked anything - obvious,
subtle, or somewhere in between, please let me know)

Cheers,
    Euan

Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

Max Leske
Hi Euan

I think it’s great that you’re trying this. I hope you know what you’re getting yourself into :)


I’m no Unicode expert but I want to add two points to your list (although you’ve probably already thought of them):
- Normalisation and conversion (http://unicode.org/faq/normalization.html).
        Unicode / ICU provide libraries (libuconv / libiconv) that handle this stuff. Specifically normalisation conversions
        aren’t trivial and I think it wouldn’t make much sense to reimplement those algorithms. I do think however, that
        having them available is important (where I work we’re currently writing a VM plugin for access to libiconv through
        primitives so that we can clean out combining characters through normalisation. And we’ll obviously get nice sorting
        properties and speeds for free)
- Sorting and comparison.
        Basically the same point as above. libuconv / libiconv provide algorithms for this. Do we need our own implementation?

Cheers,
Max


> On 04 Dec 2015, at 12:42, EuanM <[hidden email]> wrote:
>
> I'm currently groping my way to seeing how feature-complete our
> Unicode support is.  I am doing this to establish what still needs to
> be done to provide full Unicode support.
>
> This seems to me to be an area where it would be best to write it
> once, and then have the same codebase incorporated into the Smalltalks
> that most share a common ancestry.
>
> I am keen to get: equality-testing for strings; sortability for
> strings which have ligatures and diacritic characters; and correct
> round-tripping of data.
>
> Call to action:
> ==========
>
> If you have comments on these proposals - such as "but we already have
> that facility" or "the reason we do not have these facilities is
> because they are dog-slow" - please let me know them.
>
> If you would like to help out, please let me know.
>
> If you have Unicode experience and expertise, and would like to be, or
> would be willing to be, in the  'council of experts' for this project,
> please let me know.
>
> If you have comments or ideas on anything mentioned in this email
>
> In the first instance, the initiative's website will be:
> http://smalltalk.uk.to/unicode.html
>
> I have created a SqueakSource.com project called UnicodeSupport
>
> I want to avoid re-inventing any facilities which already exist.
> Except where they prevent us reaching the goals of:
>  - sortable UTF8 strings
>  - sortable UTF16 strings
>  - equivalence testing of 2 UTF8 strings
>  - equivalence testing of 2 UTF16 strings
>  - round-tripping UTF8 strings through Smalltalk
>  - roundtripping UTF16 strings through Smalltalk.
> As I understand it, we have limited Unicode support atm.
>
> Current state of play
> ===============
> ByteString gets converted to WideString when need is automagically detected.
>
> Is there anything else that currently exists?
>
> Definition of Terms
> ==============
> A quick definition of terms before I go any further:
>
> Standard terms from the Unicode standard
> ===============================
> a compatibility character : an additional encoding of a *normal*
> character, for compatibility and round-trip conversion purposes.  For
> instance, a 1-byte encoding of a Latin character with a diacritic.
>
> Made-up terms
> ============
> a convenience codepoint :  a single codepoint which represents an item
> that is also encoded as a string of codepoints.
>
> (I tend to use the terms compatibility character and compatibility
> codepoint interchangably.  The standard only refers to them as
> compatibility characters.  However, the standard is determined to
> emphasise that characters are abstract and that codepoints are
> concrete.  So I think it is often more useful and productive to think
> of compatibility or convenience codepoints).
>
> a composed character :  a character made up of several codepoints
>
> Unicode encoding explained
> =====================
> A convenience codepoint can therefore be thought of as a code point
> used for a character which also has a composed form.
>
> The way Unicode works is that sometimes you can encode a character in
> one byte, sometimes not.  Sometimes you can encode it in two bytes,
> sometimes not.
>
> You can therefore have a long stream of ASCII which is single-byte
> Unicode.  If there is an occasional Cyrillic or Greek character in the
> stream, it would be represented either by a compatibility character or
> by a multi-byte combination.
>
> Using compatibility characters can prevent proper sorting and
> equivalence testing.
>
> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
> and round-tripping probelms.  Although avoiding them can *also* cause
> compatibility issues and round-tripping problems.
>
> Currently my thinking is:
>
> a Utf8String class
> an Ordered collection, with 1 byte characters as the modal element,
> but short arrays of wider strings where necessary
> a Utf16String class
> an Ordered collection, with 2 byte characters as the modal element,
> but short arrays of wider strings
> beginning with a 2-byte endianness indicator.
>
> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>
> So my thinking is that Utf8String will contain convenience codepoints,
> for round-tripping.  And where there are multiple convenience
> codepoints for a character, that it standardises on one.
>
> And that there is a Utf8SortableString which uses *only* normal characters.
>
> We then need methods to convert between the two.
>
> aUtf8String asUtf8SortableString
>
> and
>
> aUtf8SortableString asUtf8String
>
>
> Sort orders are culture and context dependent - Sweden and Germany
> have different sort orders for the same diacritic-ed characters.  Some
> countries have one order in general usage, and another for specific
> usages, such as phone directories (e.g. UK and France)
>
> Similarly for Utf16 :  Utf16String and Utf16SortableString and
> conversion methods
>
> A list of sorted words would be a SortedCollection, and there could be
> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
> seOrder, ukOrder, etc
>
> along the lines of
> aListOfWords := SortedCollection sortBlock: deOrder
>
> If a word is either a Utf8SortableString, or a well-formed Utf8String,
> then we can perform equivalence testing on them trivially.
>
> To make sure a Utf8String is well formed, we would need to have a way
> of cleaning up any convenience codepoints which were valid, but which
> were for a character which has multiple equally-valid alternative
> convenience codepoints, and for which the string currently had the
> "wrong" convenience codepoint.  (i.e for any character with valid
> alternative convenience codepoints, we would choose one to be in the
> well-formed Utf8String, and we would need a method for cleaning the
> alternative convenience codepoints out of the string, and replacing
> them with the chosen approved convenience codepoint.
>
> aUtf8String cleanUtf8String
>
> With WideString, a lot of the issues disappear - except
> round-tripping(although I'm sure I have seen something recently about
> 4-byte strings that also have an additional bit.  Which would make
> some Unicode characters 5-bytes long.)
>
>
> (I'm starting to zone out now - if I've overlooked anything - obvious,
> subtle, or somewhere in between, please let me know)
>
> Cheers,
>    Euan
>


Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

Sven Van Caekenberghe-2

> On 04 Dec 2015, at 17:00, Max Leske <[hidden email]> wrote:
>
> Hi Euan
>
> I think it’s great that you’re trying this. I hope you know what you’re getting yourself into :)
>
>
> I’m no Unicode expert but I want to add two points to your list (although you’ve probably already thought of them):
> - Normalisation and conversion (http://unicode.org/faq/normalization.html).
> Unicode / ICU provide libraries (libuconv / libiconv) that handle this stuff. Specifically normalisation conversions
> aren’t trivial and I think it wouldn’t make much sense to reimplement those algorithms. I do think however, that
> having them available is important (where I work we’re currently writing a VM plugin for access to libiconv through
> primitives so that we can clean out combining characters through normalisation. And we’ll obviously get nice sorting
> properties and speeds for free)
> - Sorting and comparison.
> Basically the same point as above. libuconv / libiconv provide algorithms for this. Do we need our own implementation?

These 2 are indeed missing and it would be good to add them.

We already have UTF8/UTF16 encoding/decoding, even 2 implementations. See http://files.pharo.org/books/enterprisepharo/book/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html for the modern version.

But IMHO it would not be a good idea to try to implement functionality on in image strings with those representations, it would be too slow.

But of course, if you want to try to implement something and show us, go for it.

> Cheers,
> Max
>
>
>> On 04 Dec 2015, at 12:42, EuanM <[hidden email]> wrote:
>>
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
>>
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>>
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
>>
>> Call to action:
>> ==========
>>
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>>
>> If you would like to help out, please let me know.
>>
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>>
>> If you have comments or ideas on anything mentioned in this email
>>
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>>
>> I have created a SqueakSource.com project called UnicodeSupport
>>
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>> - sortable UTF8 strings
>> - sortable UTF16 strings
>> - equivalence testing of 2 UTF8 strings
>> - equivalence testing of 2 UTF16 strings
>> - round-tripping UTF8 strings through Smalltalk
>> - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>>
>> Current state of play
>> ===============
>> ByteString gets converted to WideString when need is automagically detected.
>>
>> Is there anything else that currently exists?
>>
>> Definition of Terms
>> ==============
>> A quick definition of terms before I go any further:
>>
>> Standard terms from the Unicode standard
>> ===============================
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>
>> Made-up terms
>> ============
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>>
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>>
>> a composed character :  a character made up of several codepoints
>>
>> Unicode encoding explained
>> =====================
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>>
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>>
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>>
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>>
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>>
>> Currently my thinking is:
>>
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginning with a 2-byte endianness indicator.
>>
>> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>>
>> So my thinking is that Utf8String will contain convenience codepoints,
>> for round-tripping.  And where there are multiple convenience
>> codepoints for a character, that it standardises on one.
>>
>> And that there is a Utf8SortableString which uses *only* normal characters.
>>
>> We then need methods to convert between the two.
>>
>> aUtf8String asUtf8SortableString
>>
>> and
>>
>> aUtf8SortableString asUtf8String
>>
>>
>> Sort orders are culture and context dependent - Sweden and Germany
>> have different sort orders for the same diacritic-ed characters.  Some
>> countries have one order in general usage, and another for specific
>> usages, such as phone directories (e.g. UK and France)
>>
>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>> conversion methods
>>
>> A list of sorted words would be a SortedCollection, and there could be
>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>> seOrder, ukOrder, etc
>>
>> along the lines of
>> aListOfWords := SortedCollection sortBlock: deOrder
>>
>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>> then we can perform equivalence testing on them trivially.
>>
>> To make sure a Utf8String is well formed, we would need to have a way
>> of cleaning up any convenience codepoints which were valid, but which
>> were for a character which has multiple equally-valid alternative
>> convenience codepoints, and for which the string currently had the
>> "wrong" convenience codepoint.  (i.e for any character with valid
>> alternative convenience codepoints, we would choose one to be in the
>> well-formed Utf8String, and we would need a method for cleaning the
>> alternative convenience codepoints out of the string, and replacing
>> them with the chosen approved convenience codepoint.
>>
>> aUtf8String cleanUtf8String
>>
>> With WideString, a lot of the issues disappear - except
>> round-tripping(although I'm sure I have seen something recently about
>> 4-byte strings that also have an additional bit.  Which would make
>> some Unicode characters 5-bytes long.)
>>
>>
>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>> subtle, or somewhere in between, please let me know)
>>
>> Cheers,
>>   Euan
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

stepharo
In reply to this post by EuanM
Hi EuanM

Le 4/12/15 12:42, EuanM a écrit :
> I'm currently groping my way to seeing how feature-complete our
> Unicode support is.  I am doing this to establish what still needs to
> be done to provide full Unicode support.

this is great. Thanks for pushing this. I wrote and collected some
roadmap (analyses on different topics)
on the pharo github project feel free to add this one there.
>
> This seems to me to be an area where it would be best to write it
> once, and then have the same codebase incorporated into the Smalltalks
> that most share a common ancestry.
>
> I am keen to get: equality-testing for strings; sortability for
> strings which have ligatures and diacritic characters; and correct
> round-tripping of data.
Go!
My suggestion is
     start small
     make steady progress
     write tests
     commit often :)

Stef

What is the french phoneBook ordering because this is the first time I
hear about it.

>
> Call to action:
> ==========
>
> If you have comments on these proposals - such as "but we already have
> that facility" or "the reason we do not have these facilities is
> because they are dog-slow" - please let me know them.
>
> If you would like to help out, please let me know.
>
> If you have Unicode experience and expertise, and would like to be, or
> would be willing to be, in the  'council of experts' for this project,
> please let me know.
>
> If you have comments or ideas on anything mentioned in this email
>
> In the first instance, the initiative's website will be:
> http://smalltalk.uk.to/unicode.html
>
> I have created a SqueakSource.com project called UnicodeSupport
>
> I want to avoid re-inventing any facilities which already exist.
> Except where they prevent us reaching the goals of:
>    - sortable UTF8 strings
>    - sortable UTF16 strings
>    - equivalence testing of 2 UTF8 strings
>    - equivalence testing of 2 UTF16 strings
>    - round-tripping UTF8 strings through Smalltalk
>    - roundtripping UTF16 strings through Smalltalk.
> As I understand it, we have limited Unicode support atm.
>
> Current state of play
> ===============
> ByteString gets converted to WideString when need is automagically detected.
>
> Is there anything else that currently exists?
>
> Definition of Terms
> ==============
> A quick definition of terms before I go any further:
>
> Standard terms from the Unicode standard
> ===============================
> a compatibility character : an additional encoding of a *normal*
> character, for compatibility and round-trip conversion purposes.  For
> instance, a 1-byte encoding of a Latin character with a diacritic.
>
> Made-up terms
> ============
> a convenience codepoint :  a single codepoint which represents an item
> that is also encoded as a string of codepoints.
>
> (I tend to use the terms compatibility character and compatibility
> codepoint interchangably.  The standard only refers to them as
> compatibility characters.  However, the standard is determined to
> emphasise that characters are abstract and that codepoints are
> concrete.  So I think it is often more useful and productive to think
> of compatibility or convenience codepoints).
>
> a composed character :  a character made up of several codepoints
>
> Unicode encoding explained
> =====================
> A convenience codepoint can therefore be thought of as a code point
> used for a character which also has a composed form.
>
> The way Unicode works is that sometimes you can encode a character in
> one byte, sometimes not.  Sometimes you can encode it in two bytes,
> sometimes not.
>
> You can therefore have a long stream of ASCII which is single-byte
> Unicode.  If there is an occasional Cyrillic or Greek character in the
> stream, it would be represented either by a compatibility character or
> by a multi-byte combination.
>
> Using compatibility characters can prevent proper sorting and
> equivalence testing.
>
> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
> and round-tripping probelms.  Although avoiding them can *also* cause
> compatibility issues and round-tripping problems.
>
> Currently my thinking is:
>
> a Utf8String class
> an Ordered collection, with 1 byte characters as the modal element,
> but short arrays of wider strings where necessary
> a Utf16String class
> an Ordered collection, with 2 byte characters as the modal element,
> but short arrays of wider strings
> beginning with a 2-byte endianness indicator.
>
> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>
> So my thinking is that Utf8String will contain convenience codepoints,
> for round-tripping.  And where there are multiple convenience
> codepoints for a character, that it standardises on one.
>
> And that there is a Utf8SortableString which uses *only* normal characters.
>
> We then need methods to convert between the two.
>
> aUtf8String asUtf8SortableString
>
> and
>
> aUtf8SortableString asUtf8String
>
>
> Sort orders are culture and context dependent - Sweden and Germany
> have different sort orders for the same diacritic-ed characters.  Some
> countries have one order in general usage, and another for specific
> usages, such as phone directories (e.g. UK and France)
>
> Similarly for Utf16 :  Utf16String and Utf16SortableString and
> conversion methods
>
> A list of sorted words would be a SortedCollection, and there could be
> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
> seOrder, ukOrder, etc
>
> along the lines of
> aListOfWords := SortedCollection sortBlock: deOrder
>
> If a word is either a Utf8SortableString, or a well-formed Utf8String,
> then we can perform equivalence testing on them trivially.
>
> To make sure a Utf8String is well formed, we would need to have a way
> of cleaning up any convenience codepoints which were valid, but which
> were for a character which has multiple equally-valid alternative
> convenience codepoints, and for which the string currently had the
> "wrong" convenience codepoint.  (i.e for any character with valid
> alternative convenience codepoints, we would choose one to be in the
> well-formed Utf8String, and we would need a method for cleaning the
> alternative convenience codepoints out of the string, and replacing
> them with the chosen approved convenience codepoint.
>
> aUtf8String cleanUtf8String
>
> With WideString, a lot of the issues disappear - except
> round-tripping(although I'm sure I have seen something recently about
> 4-byte strings that also have an additional bit.  Which would make
> some Unicode characters 5-bytes long.)
>
>
> (I'm starting to zone out now - if I've overlooked anything - obvious,
> subtle, or somewhere in between, please let me know)
>
> Cheers,
>      Euan
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

tblanchard


Sent from the road

> On Dec 5, 2015, at 05:08, stepharo <[hidden email]> wrote:
>
> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
>
> this is great. Thanks for pushing this. I wrote and collected some roadmap (analyses on different topics)
> on the pharo github project feel free to add this one there.
>>
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>>
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
> Go!
> My suggestion is
>    start small
>    make steady progress
>    write tests
>    commit often :)
>
> Stef
>
> What is the french phoneBook ordering because this is the first time I hear about it.
>>
>> Call to action:
>> ==========
>>
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>>
>> If you would like to help out, please let me know.
>>
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>>
>> If you have comments or ideas on anything mentioned in this email
>>
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>>
>> I have created a SqueakSource.com project called UnicodeSupport
>>
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>   - sortable UTF8 strings
>>   - sortable UTF16 strings
>>   - equivalence testing of 2 UTF8 strings
>>   - equivalence testing of 2 UTF16 strings
>>   - round-tripping UTF8 strings through Smalltalk
>>   - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>>
>> Current state of play
>> ===============
>> ByteString gets converted to WideString when need is automagically detected.
>>
>> Is there anything else that currently exists?
>>
>> Definition of Terms
>> ==============
>> A quick definition of terms before I go any further:
>>
>> Standard terms from the Unicode standard
>> ===============================
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>
>> Made-up terms
>> ============
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>>
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>>
>> a composed character :  a character made up of several codepoints
>>
>> Unicode encoding explained
>> =====================
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>>
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>>
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>>
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>>
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>>
>> Currently my thinking is:
>>
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginning with a 2-byte endianness indicator.
>>
>> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>>
>> So my thinking is that Utf8String will contain convenience codepoints,
>> for round-tripping.  And where there are multiple convenience
>> codepoints for a character, that it standardises on one.
>>
>> And that there is a Utf8SortableString which uses *only* normal characters.
>>
>> We then need methods to convert between the two.
>>
>> aUtf8String asUtf8SortableString
>>
>> and
>>
>> aUtf8SortableString asUtf8String
>>
>>
>> Sort orders are culture and context dependent - Sweden and Germany
>> have different sort orders for the same diacritic-ed characters.  Some
>> countries have one order in general usage, and another for specific
>> usages, such as phone directories (e.g. UK and France)
>>
>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>> conversion methods
>>
>> A list of sorted words would be a SortedCollection, and there could be
>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>> seOrder, ukOrder, etc
>>
>> along the lines of
>> aListOfWords := SortedCollection sortBlock: deOrder
>>
>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>> then we can perform equivalence testing on them trivially.
>>
>> To make sure a Utf8String is well formed, we would need to have a way
>> of cleaning up any convenience codepoints which were valid, but which
>> were for a character which has multiple equally-valid alternative
>> convenience codepoints, and for which the string currently had the
>> "wrong" convenience codepoint.  (i.e for any character with valid
>> alternative convenience codepoints, we would choose one to be in the
>> well-formed Utf8String, and we would need a method for cleaning the
>> alternative convenience codepoints out of the string, and replacing
>> them with the chosen approved convenience codepoint.
>>
>> aUtf8String cleanUtf8String
>>
>> With WideString, a lot of the issues disappear - except
>> round-tripping(although I'm sure I have seen something recently about
>> 4-byte strings that also have an additional bit.  Which would make
>> some Unicode characters 5-bytes long.)
>>
>>
>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>> subtle, or somewhere in between, please let me know)
>>
>> Cheers,
>>     Euan
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

tblanchard
In reply to this post by stepharo
would suggest that the only worthwhile encoding is UTF8 - the rest are distractions except for being able to read and convert from other encodings to UTF8. UTF16 is a complete waste of time. 

Read http://utf8everywhere.org/

I have extensive Unicode chops from around 1999 to 2004 and my experience leads me to strongly agree with the views on that site.


Sent from the road

On Dec 5, 2015, at 05:08, stepharo <[hidden email]> wrote:

Hi EuanM

Le 4/12/15 12:42, EuanM a écrit :
I'm currently groping my way to seeing how feature-complete our
Unicode support is.  I am doing this to establish what still needs to
be done to provide full Unicode support.

this is great. Thanks for pushing this. I wrote and collected some roadmap (analyses on different topics)
on the pharo github project feel free to add this one there.

This seems to me to be an area where it would be best to write it
once, and then have the same codebase incorporated into the Smalltalks
that most share a common ancestry.

I am keen to get: equality-testing for strings; sortability for
strings which have ligatures and diacritic characters; and correct
round-tripping of data.
Go!
My suggestion is
   start small
   make steady progress
   write tests
   commit often :)

Stef

What is the french phoneBook ordering because this is the first time I hear about it.

Call to action:
==========

If you have comments on these proposals - such as "but we already have
that facility" or "the reason we do not have these facilities is
because they are dog-slow" - please let me know them.

If you would like to help out, please let me know.

If you have Unicode experience and expertise, and would like to be, or
would be willing to be, in the  'council of experts' for this project,
please let me know.

If you have comments or ideas on anything mentioned in this email

In the first instance, the initiative's website will be:
http://smalltalk.uk.to/unicode.html

I have created a SqueakSource.com project called UnicodeSupport

I want to avoid re-inventing any facilities which already exist.
Except where they prevent us reaching the goals of:
  - sortable UTF8 strings
  - sortable UTF16 strings
  - equivalence testing of 2 UTF8 strings
  - equivalence testing of 2 UTF16 strings
  - round-tripping UTF8 strings through Smalltalk
  - roundtripping UTF16 strings through Smalltalk.
As I understand it, we have limited Unicode support atm.

Current state of play
===============
ByteString gets converted to WideString when need is automagically detected.

Is there anything else that currently exists?

Definition of Terms
==============
A quick definition of terms before I go any further:

Standard terms from the Unicode standard
===============================
a compatibility character : an additional encoding of a *normal*
character, for compatibility and round-trip conversion purposes.  For
instance, a 1-byte encoding of a Latin character with a diacritic.

Made-up terms
============
a convenience codepoint :  a single codepoint which represents an item
that is also encoded as a string of codepoints.

(I tend to use the terms compatibility character and compatibility
codepoint interchangably.  The standard only refers to them as
compatibility characters.  However, the standard is determined to
emphasise that characters are abstract and that codepoints are
concrete.  So I think it is often more useful and productive to think
of compatibility or convenience codepoints).

a composed character :  a character made up of several codepoints

Unicode encoding explained
=====================
A convenience codepoint can therefore be thought of as a code point
used for a character which also has a composed form.

The way Unicode works is that sometimes you can encode a character in
one byte, sometimes not.  Sometimes you can encode it in two bytes,
sometimes not.

You can therefore have a long stream of ASCII which is single-byte
Unicode.  If there is an occasional Cyrillic or Greek character in the
stream, it would be represented either by a compatibility character or
by a multi-byte combination.

Using compatibility characters can prevent proper sorting and
equivalence testing.

Using "pure" Unicode, ie. "normal encodings", can cause compatibility
and round-tripping probelms.  Although avoiding them can *also* cause
compatibility issues and round-tripping problems.

Currently my thinking is:

a Utf8String class
an Ordered collection, with 1 byte characters as the modal element,
but short arrays of wider strings where necessary
a Utf16String class
an Ordered collection, with 2 byte characters as the modal element,
but short arrays of wider strings
beginning with a 2-byte endianness indicator.

Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.

So my thinking is that Utf8String will contain convenience codepoints,
for round-tripping.  And where there are multiple convenience
codepoints for a character, that it standardises on one.

And that there is a Utf8SortableString which uses *only* normal characters.

We then need methods to convert between the two.

aUtf8String asUtf8SortableString

and

aUtf8SortableString asUtf8String


Sort orders are culture and context dependent - Sweden and Germany
have different sort orders for the same diacritic-ed characters.  Some
countries have one order in general usage, and another for specific
usages, such as phone directories (e.g. UK and France)

Similarly for Utf16 :  Utf16String and Utf16SortableString and
conversion methods

A list of sorted words would be a SortedCollection, and there could be
pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
seOrder, ukOrder, etc

along the lines of
aListOfWords := SortedCollection sortBlock: deOrder

If a word is either a Utf8SortableString, or a well-formed Utf8String,
then we can perform equivalence testing on them trivially.

To make sure a Utf8String is well formed, we would need to have a way
of cleaning up any convenience codepoints which were valid, but which
were for a character which has multiple equally-valid alternative
convenience codepoints, and for which the string currently had the
"wrong" convenience codepoint.  (i.e for any character with valid
alternative convenience codepoints, we would choose one to be in the
well-formed Utf8String, and we would need a method for cleaning the
alternative convenience codepoints out of the string, and replacing
them with the chosen approved convenience codepoint.

aUtf8String cleanUtf8String

With WideString, a lot of the issues disappear - except
round-tripping(although I'm sure I have seen something recently about
4-byte strings that also have an additional bit.  Which would make
some Unicode characters 5-bytes long.)


(I'm starting to zone out now - if I've overlooked anything - obvious,
subtle, or somewhere in between, please let me know)

Cheers,
    Euan




Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

stepharo
Hi todd

thanks for the link.
It looks really interesting.

Stef

Le 5/12/15 17:35, Todd Blanchard a écrit :
would suggest that the only worthwhile encoding is UTF8 - the rest are distractions except for being able to read and convert from other encodings to UTF8. UTF16 is a complete waste of time. 

Read http://utf8everywhere.org/

I have extensive Unicode chops from around 1999 to 2004 and my experience leads me to strongly agree with the views on that site.


Sent from the road

On Dec 5, 2015, at 05:08, stepharo <[hidden email]> wrote:

Hi EuanM

Le 4/12/15 12:42, EuanM a écrit :
I'm currently groping my way to seeing how feature-complete our
Unicode support is.  I am doing this to establish what still needs to
be done to provide full Unicode support.

this is great. Thanks for pushing this. I wrote and collected some roadmap (analyses on different topics)
on the pharo github project feel free to add this one there.

This seems to me to be an area where it would be best to write it
once, and then have the same codebase incorporated into the Smalltalks
that most share a common ancestry.

I am keen to get: equality-testing for strings; sortability for
strings which have ligatures and diacritic characters; and correct
round-tripping of data.
Go!
My suggestion is
   start small
   make steady progress
   write tests
   commit often :)

Stef

What is the french phoneBook ordering because this is the first time I hear about it.

Call to action:
==========

If you have comments on these proposals - such as "but we already have
that facility" or "the reason we do not have these facilities is
because they are dog-slow" - please let me know them.

If you would like to help out, please let me know.

If you have Unicode experience and expertise, and would like to be, or
would be willing to be, in the  'council of experts' for this project,
please let me know.

If you have comments or ideas on anything mentioned in this email

In the first instance, the initiative's website will be:
http://smalltalk.uk.to/unicode.html

I have created a SqueakSource.com project called UnicodeSupport

I want to avoid re-inventing any facilities which already exist.
Except where they prevent us reaching the goals of:
  - sortable UTF8 strings
  - sortable UTF16 strings
  - equivalence testing of 2 UTF8 strings
  - equivalence testing of 2 UTF16 strings
  - round-tripping UTF8 strings through Smalltalk
  - roundtripping UTF16 strings through Smalltalk.
As I understand it, we have limited Unicode support atm.

Current state of play
===============
ByteString gets converted to WideString when need is automagically detected.

Is there anything else that currently exists?

Definition of Terms
==============
A quick definition of terms before I go any further:

Standard terms from the Unicode standard
===============================
a compatibility character : an additional encoding of a *normal*
character, for compatibility and round-trip conversion purposes.  For
instance, a 1-byte encoding of a Latin character with a diacritic.

Made-up terms
============
a convenience codepoint :  a single codepoint which represents an item
that is also encoded as a string of codepoints.

(I tend to use the terms compatibility character and compatibility
codepoint interchangably.  The standard only refers to them as
compatibility characters.  However, the standard is determined to
emphasise that characters are abstract and that codepoints are
concrete.  So I think it is often more useful and productive to think
of compatibility or convenience codepoints).

a composed character :  a character made up of several codepoints

Unicode encoding explained
=====================
A convenience codepoint can therefore be thought of as a code point
used for a character which also has a composed form.

The way Unicode works is that sometimes you can encode a character in
one byte, sometimes not.  Sometimes you can encode it in two bytes,
sometimes not.

You can therefore have a long stream of ASCII which is single-byte
Unicode.  If there is an occasional Cyrillic or Greek character in the
stream, it would be represented either by a compatibility character or
by a multi-byte combination.

Using compatibility characters can prevent proper sorting and
equivalence testing.

Using "pure" Unicode, ie. "normal encodings", can cause compatibility
and round-tripping probelms.  Although avoiding them can *also* cause
compatibility issues and round-tripping problems.

Currently my thinking is:

a Utf8String class
an Ordered collection, with 1 byte characters as the modal element,
but short arrays of wider strings where necessary
a Utf16String class
an Ordered collection, with 2 byte characters as the modal element,
but short arrays of wider strings
beginning with a 2-byte endianness indicator.

Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.

So my thinking is that Utf8String will contain convenience codepoints,
for round-tripping.  And where there are multiple convenience
codepoints for a character, that it standardises on one.

And that there is a Utf8SortableString which uses *only* normal characters.

We then need methods to convert between the two.

aUtf8String asUtf8SortableString

and

aUtf8SortableString asUtf8String


Sort orders are culture and context dependent - Sweden and Germany
have different sort orders for the same diacritic-ed characters.  Some
countries have one order in general usage, and another for specific
usages, such as phone directories (e.g. UK and France)

Similarly for Utf16 :  Utf16String and Utf16SortableString and
conversion methods

A list of sorted words would be a SortedCollection, and there could be
pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
seOrder, ukOrder, etc

along the lines of
aListOfWords := SortedCollection sortBlock: deOrder

If a word is either a Utf8SortableString, or a well-formed Utf8String,
then we can perform equivalence testing on them trivially.

To make sure a Utf8String is well formed, we would need to have a way
of cleaning up any convenience codepoints which were valid, but which
were for a character which has multiple equally-valid alternative
convenience codepoints, and for which the string currently had the
"wrong" convenience codepoint.  (i.e for any character with valid
alternative convenience codepoints, we would choose one to be in the
well-formed Utf8String, and we would need a method for cleaning the
alternative convenience codepoints out of the string, and replacing
them with the chosen approved convenience codepoint.

aUtf8String cleanUtf8String

With WideString, a lot of the issues disappear - except
round-tripping(although I'm sure I have seen something recently about
4-byte strings that also have an additional bit.  Which would make
some Unicode characters 5-bytes long.)


(I'm starting to zone out now - if I've overlooked anything - obvious,
subtle, or somewhere in between, please let me know)

Cheers,
    Euan





Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Unicode Support

Hannes Hirzel
In reply to this post by EuanM
On 12/6/15, Levente Uzonyi <[hidden email]> wrote:

> On Sat, 5 Dec 2015, Colin Putney wrote:
>
>> First, what's UTF-32? Second, we have the whole language tag thing that
>> nobody else uses.
>
> In Squeak, Strings use UTF-32 encoding[1]. It's straightforward
> to see for WideString, but ByteString is just a subset of WideString, so
> it uses the same encoding. We also use language tags, but that's a
> different story.
> Language tags make it possible to work around the problems introduced by
> the Han unification[2]. We shouldn't really use them for non-CJKV
> languages.
>
>>
>> Finally, UTF-8 is a great encoding that certain kinds of applications
>> really ought to use. Web apps, in particular, benefit from using UTF-8 so
>> the don't have to decode and then re-encode strings coming in from the
>> network. In DabbleDB we used UTF-8 encoded string in the image, and just
>> ignored the fact that they were displayed incorrectly by inspectors.
>> Having a proper UTF-8 string class would be useful.
>
> We do the same thing, but that doesn't mean it's a good idea to create a
> new String-like class having its content encoded in UTF-8, because
> UTF-8-encoded strings can't be modified like regular strings. While it
> would be possible to implement all operations, such implementation would
> become the next SortedCollection (bad performance due to misuse).


This is not the case if you go for ropes

https://github.com/KenDickey/Cuis-Smalltalk-Ropes

>
> Levente
>
> [1] https://en.wikipedia.org/wiki/UTF-32
> [2] https://en.wikipedia.org/wiki/Han_unification
>
>>
>> - Colin
>>
>>
>>> On Dec 4, 2015, at 6:46 AM, Levente Uzonyi <[hidden email]> wrote:
>>>
>>> Why would you want to have strings with UTF-8 or UTF-16 encoding in the
>>> image?
>>> What's wrong with the current UTF-32 representation?
>>>
>>> Levente
>>>
>>>> On Fri, 4 Dec 2015, EuanM wrote:
>>>>
>>>> I'm currently groping my way to seeing how feature-complete our
>>>> Unicode support is.  I am doing this to establish what still needs to
>>>> be done to provide full Unicode support.
>>>>
>>>> This seems to me to be an area where it would be best to write it
>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>> that most share a common ancestry.
>>>>
>>>> I am keen to get: equality-testing for strings; sortability for
>>>> strings which have ligatures and diacritic characters; and correct
>>>> round-tripping of data.
>>>>
>>>> Call to action:
>>>> ==========
>>>>
>>>> If you have comments on these proposals - such as "but we already have
>>>> that facility" or "the reason we do not have these facilities is
>>>> because they are dog-slow" - please let me know them.
>>>>
>>>> If you would like to help out, please let me know.
>>>>
>>>> If you have Unicode experience and expertise, and would like to be, or
>>>> would be willing to be, in the  'council of experts' for this project,
>>>> please let me know.
>>>>
>>>> If you have comments or ideas on anything mentioned in this email
>>>>
>>>> In the first instance, the initiative's website will be:
>>>> http://smalltalk.uk.to/unicode.html
>>>>
>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>
>>>> I want to avoid re-inventing any facilities which already exist.
>>>> Except where they prevent us reaching the goals of:
>>>> - sortable UTF8 strings
>>>> - sortable UTF16 strings
>>>> - equivalence testing of 2 UTF8 strings
>>>> - equivalence testing of 2 UTF16 strings
>>>> - round-tripping UTF8 strings through Smalltalk
>>>> - roundtripping UTF16 strings through Smalltalk.
>>>> As I understand it, we have limited Unicode support atm.
>>>>
>>>> Current state of play
>>>> ===============
>>>> ByteString gets converted to WideString when need is automagically
>>>> detected.
>>>>
>>>> Is there anything else that currently exists?
>>>>
>>>> Definition of Terms
>>>> ==============
>>>> A quick definition of terms before I go any further:
>>>>
>>>> Standard terms from the Unicode standard
>>>> ===============================
>>>> a compatibility character : an additional encoding of a *normal*
>>>> character, for compatibility and round-trip conversion purposes.  For
>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>
>>>> Made-up terms
>>>> ============
>>>> a convenience codepoint :  a single codepoint which represents an item
>>>> that is also encoded as a string of codepoints.
>>>>
>>>> (I tend to use the terms compatibility character and compatibility
>>>> codepoint interchangably.  The standard only refers to them as
>>>> compatibility characters.  However, the standard is determined to
>>>> emphasise that characters are abstract and that codepoints are
>>>> concrete.  So I think it is often more useful and productive to think
>>>> of compatibility or convenience codepoints).
>>>>
>>>> a composed character :  a character made up of several codepoints
>>>>
>>>> Unicode encoding explained
>>>> =====================
>>>> A convenience codepoint can therefore be thought of as a code point
>>>> used for a character which also has a composed form.
>>>>
>>>> The way Unicode works is that sometimes you can encode a character in
>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>> sometimes not.
>>>>
>>>> You can therefore have a long stream of ASCII which is single-byte
>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>> stream, it would be represented either by a compatibility character or
>>>> by a multi-byte combination.
>>>>
>>>> Using compatibility characters can prevent proper sorting and
>>>> equivalence testing.
>>>>
>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>> compatibility issues and round-tripping problems.
>>>>
>>>> Currently my thinking is:
>>>>
>>>> a Utf8String class
>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>> but short arrays of wider strings where necessary
>>>> a Utf16String class
>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>> but short arrays of wider strings
>>>> beginning with a 2-byte endianness indicator.
>>>>
>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>> compatible.
>>>>
>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>> for round-tripping.  And where there are multiple convenience
>>>> codepoints for a character, that it standardises on one.
>>>>
>>>> And that there is a Utf8SortableString which uses *only* normal
>>>> characters.
>>>>
>>>> We then need methods to convert between the two.
>>>>
>>>> aUtf8String asUtf8SortableString
>>>>
>>>> and
>>>>
>>>> aUtf8SortableString asUtf8String
>>>>
>>>>
>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>> countries have one order in general usage, and another for specific
>>>> usages, such as phone directories (e.g. UK and France)
>>>>
>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>> conversion methods
>>>>
>>>> A list of sorted words would be a SortedCollection, and there could be
>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>> seOrder, ukOrder, etc
>>>>
>>>> along the lines of
>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>
>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>> then we can perform equivalence testing on them trivially.
>>>>
>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>> of cleaning up any convenience codepoints which were valid, but which
>>>> were for a character which has multiple equally-valid alternative
>>>> convenience codepoints, and for which the string currently had the
>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>> alternative convenience codepoints, we would choose one to be in the
>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>> alternative convenience codepoints out of the string, and replacing
>>>> them with the chosen approved convenience codepoint.
>>>>
>>>> aUtf8String cleanUtf8String
>>>>
>>>> With WideString, a lot of the issues disappear - except
>>>> round-tripping(although I'm sure I have seen something recently about
>>>> 4-byte strings that also have an additional bit.  Which would make
>>>> some Unicode characters 5-bytes long.)
>>>>
>>>>
>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>> subtle, or somewhere in between, please let me know)
>>>>
>>>> Cheers,
>>>>   Euan
>>>>
>>>>
>>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Unicode Support

tblanchard
(Resent because of bounce notification (email handling in osx is really beginning to annoy me).  Sorry if its a dup)

I used to worry a lot about strings being indexable.  And then I eventually let go of that and realized that it isn't a particularly important property for them to have.

I think you will find that UTF8 is generally the most convenient for a lot of things but its a bit like light in that you treat it alternately as a wave or particle depending on what you are trying to do. 

So goes strings - they can be treated alternately as streams or byte arrays (not character arrays - stop thinking in characters).  In practice, this tends to not be a problem since a lot of the times when you want to replace a character or pick out the nth one you are doing something very computerish and the characters you are working with are the single byte (ASCII legacy) variety.  You generally know when you can get away with that and when you can't.

Otherwise you are most likely doing things that are best dealt with in a streaming paradigm.  For most computation, you come to realize you don't generally care how many characters but how much space (bytes) you need to store your chunk of text.  Collation is tricky and complicated in unicode in general but it isn't any worse in UTF8 than any other encoding.  You are still going to scan each sortable item from front to back to determine its order, regardless.

Most of the outside world has settled on UTF8 and any ASCII file is already UTF8 - which is why it ends up being so convenient.  Most of our old text handling infrastructure can still handle UTF8 while it tends to choke on wider encodings.

-Todd Blanchard 

On Dec 6, 2015, at 07:23, H. Hirzel <[hidden email]> wrote:

We do the same thing, but that doesn't mean it's a good idea to create a
new String-like class having its content encoded in UTF-8, because
UTF-8-encoded strings can't be modified like regular strings. While it
would be possible to implement all operations, such implementation would
become the next SortedCollection (bad performance due to misuse).

Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

Sven Van Caekenberghe-2
In reply to this post by tblanchard

> On 05 Dec 2015, at 17:35, Todd Blanchard <[hidden email]> wrote:
>
> would suggest that the only worthwhile encoding is UTF8 - the rest are distractions except for being able to read and convert from other encodings to UTF8. UTF16 is a complete waste of time.
>
> Read http://utf8everywhere.org/
>
> I have extensive Unicode chops from around 1999 to 2004 and my experience leads me to strongly agree with the views on that site.

Well, I read the page/document/site as well. It was very interesting indeed, thanks for sharing it.

In some sense it made me reconsider my aversion against in-image utf-8 encoding, maybe it could have some value. Absolute storage is more efficient, some processing might also be more efficient, i/o conversions to/from utf-8 become a no-op. What I found nice is the suggestion that most structured parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a large part and just assume its ASCII, which would/could be nice for performance. Also the fact that a lot of strings are (or should be) treated as opaque makes a lot of sense.

What I did not like is that much of argumentation is based on issue in the Windows world, take all that away and the document shrinks in half. I would have liked a bit more fundamental CS arguments.

Canonicalisation and sorting issues are hardly discussed.

In one place, the fact that a lot of special characters can have multiple representations is a big argument, while it is not mentioned how just treating things like a byte sequence would solve this (it doesn't AFAIU). Like how do you search for $e or $é if you know that it is possible to represent $é as just $é and as $e + $´ ?

Sven

> Sent from the road
>
> On Dec 5, 2015, at 05:08, stepharo <[hidden email]> wrote:
>
>> Hi EuanM
>>
>> Le 4/12/15 12:42, EuanM a écrit :
>>> I'm currently groping my way to seeing how feature-complete our
>>> Unicode support is.  I am doing this to establish what still needs to
>>> be done to provide full Unicode support.
>>
>> this is great. Thanks for pushing this. I wrote and collected some roadmap (analyses on different topics)
>> on the pharo github project feel free to add this one there.
>>>
>>> This seems to me to be an area where it would be best to write it
>>> once, and then have the same codebase incorporated into the Smalltalks
>>> that most share a common ancestry.
>>>
>>> I am keen to get: equality-testing for strings; sortability for
>>> strings which have ligatures and diacritic characters; and correct
>>> round-tripping of data.
>> Go!
>> My suggestion is
>>    start small
>>    make steady progress
>>    write tests
>>    commit often :)
>>
>> Stef
>>
>> What is the french phoneBook ordering because this is the first time I hear about it.
>>>
>>> Call to action:
>>> ==========
>>>
>>> If you have comments on these proposals - such as "but we already have
>>> that facility" or "the reason we do not have these facilities is
>>> because they are dog-slow" - please let me know them.
>>>
>>> If you would like to help out, please let me know.
>>>
>>> If you have Unicode experience and expertise, and would like to be, or
>>> would be willing to be, in the  'council of experts' for this project,
>>> please let me know.
>>>
>>> If you have comments or ideas on anything mentioned in this email
>>>
>>> In the first instance, the initiative's website will be:
>>> http://smalltalk.uk.to/unicode.html
>>>
>>> I have created a SqueakSource.com project called UnicodeSupport
>>>
>>> I want to avoid re-inventing any facilities which already exist.
>>> Except where they prevent us reaching the goals of:
>>>   - sortable UTF8 strings
>>>   - sortable UTF16 strings
>>>   - equivalence testing of 2 UTF8 strings
>>>   - equivalence testing of 2 UTF16 strings
>>>   - round-tripping UTF8 strings through Smalltalk
>>>   - roundtripping UTF16 strings through Smalltalk.
>>> As I understand it, we have limited Unicode support atm.
>>>
>>> Current state of play
>>> ===============
>>> ByteString gets converted to WideString when need is automagically detected.
>>>
>>> Is there anything else that currently exists?
>>>
>>> Definition of Terms
>>> ==============
>>> A quick definition of terms before I go any further:
>>>
>>> Standard terms from the Unicode standard
>>> ===============================
>>> a compatibility character : an additional encoding of a *normal*
>>> character, for compatibility and round-trip conversion purposes.  For
>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>
>>> Made-up terms
>>> ============
>>> a convenience codepoint :  a single codepoint which represents an item
>>> that is also encoded as a string of codepoints.
>>>
>>> (I tend to use the terms compatibility character and compatibility
>>> codepoint interchangably.  The standard only refers to them as
>>> compatibility characters.  However, the standard is determined to
>>> emphasise that characters are abstract and that codepoints are
>>> concrete.  So I think it is often more useful and productive to think
>>> of compatibility or convenience codepoints).
>>>
>>> a composed character :  a character made up of several codepoints
>>>
>>> Unicode encoding explained
>>> =====================
>>> A convenience codepoint can therefore be thought of as a code point
>>> used for a character which also has a composed form.
>>>
>>> The way Unicode works is that sometimes you can encode a character in
>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>> sometimes not.
>>>
>>> You can therefore have a long stream of ASCII which is single-byte
>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>> stream, it would be represented either by a compatibility character or
>>> by a multi-byte combination.
>>>
>>> Using compatibility characters can prevent proper sorting and
>>> equivalence testing.
>>>
>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>> compatibility issues and round-tripping problems.
>>>
>>> Currently my thinking is:
>>>
>>> a Utf8String class
>>> an Ordered collection, with 1 byte characters as the modal element,
>>> but short arrays of wider strings where necessary
>>> a Utf16String class
>>> an Ordered collection, with 2 byte characters as the modal element,
>>> but short arrays of wider strings
>>> beginning with a 2-byte endianness indicator.
>>>
>>> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>>>
>>> So my thinking is that Utf8String will contain convenience codepoints,
>>> for round-tripping.  And where there are multiple convenience
>>> codepoints for a character, that it standardises on one.
>>>
>>> And that there is a Utf8SortableString which uses *only* normal characters.
>>>
>>> We then need methods to convert between the two.
>>>
>>> aUtf8String asUtf8SortableString
>>>
>>> and
>>>
>>> aUtf8SortableString asUtf8String
>>>
>>>
>>> Sort orders are culture and context dependent - Sweden and Germany
>>> have different sort orders for the same diacritic-ed characters.  Some
>>> countries have one order in general usage, and another for specific
>>> usages, such as phone directories (e.g. UK and France)
>>>
>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>> conversion methods
>>>
>>> A list of sorted words would be a SortedCollection, and there could be
>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>> seOrder, ukOrder, etc
>>>
>>> along the lines of
>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>
>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>> then we can perform equivalence testing on them trivially.
>>>
>>> To make sure a Utf8String is well formed, we would need to have a way
>>> of cleaning up any convenience codepoints which were valid, but which
>>> were for a character which has multiple equally-valid alternative
>>> convenience codepoints, and for which the string currently had the
>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>> alternative convenience codepoints, we would choose one to be in the
>>> well-formed Utf8String, and we would need a method for cleaning the
>>> alternative convenience codepoints out of the string, and replacing
>>> them with the chosen approved convenience codepoint.
>>>
>>> aUtf8String cleanUtf8String
>>>
>>> With WideString, a lot of the issues disappear - except
>>> round-tripping(although I'm sure I have seen something recently about
>>> 4-byte strings that also have an additional bit.  Which would make
>>> some Unicode characters 5-bytes long.)
>>>
>>>
>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>> subtle, or somewhere in between, please let me know)
>>>
>>> Cheers,
>>>     Euan
>>>
>>>
>>
>>


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Unicode Support

Sven Van Caekenberghe-2
In reply to this post by tblanchard
Well written, Todd. I agree, the loss of indexing might not be that big a problem in practice. The only way to find out it to try an experiment I guess.

Sven

> On 06 Dec 2015, at 17:37, Todd Blanchard <[hidden email]> wrote:
>
> (Resent because of bounce notification (email handling in osx is really beginning to annoy me).  Sorry if its a dup)
>
> I used to worry a lot about strings being indexable.  And then I eventually let go of that and realized that it isn't a particularly important property for them to have.
>
> I think you will find that UTF8 is generally the most convenient for a lot of things but its a bit like light in that you treat it alternately as a wave or particle depending on what you are trying to do.
>
> So goes strings - they can be treated alternately as streams or byte arrays (not character arrays - stop thinking in characters).  In practice, this tends to not be a problem since a lot of the times when you want to replace a character or pick out the nth one you are doing something very computerish and the characters you are working with are the single byte (ASCII legacy) variety.  You generally know when you can get away with that and when you can't.
>
> Otherwise you are most likely doing things that are best dealt with in a streaming paradigm.  For most computation, you come to realize you don't generally care how many characters but how much space (bytes) you need to store your chunk of text.  Collation is tricky and complicated in unicode in general but it isn't any worse in UTF8 than any other encoding.  You are still going to scan each sortable item from front to back to determine its order, regardless.
>
> Most of the outside world has settled on UTF8 and any ASCII file is already UTF8 - which is why it ends up being so convenient.  Most of our old text handling infrastructure can still handle UTF8 while it tends to choke on wider encodings.
>
> -Todd Blanchard
>
>> On Dec 6, 2015, at 07:23, H. Hirzel <[hidden email]> wrote:
>>
>>> We do the same thing, but that doesn't mean it's a good idea to create a
>>> new String-like class having its content encoded in UTF-8, because
>>> UTF-8-encoded strings can't be modified like regular strings. While it
>>> would be possible to implement all operations, such implementation would
>>> become the next SortedCollection (bad performance due to misuse).
>


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Unicode Support

Sven Van Caekenberghe-2
BTW, does anyone know of any programming language that did go that way or has a library that directly implements 'storing all strings as utf-8' ?

> On 06 Dec 2015, at 18:45, Sven Van Caekenberghe <[hidden email]> wrote:
>
> Well written, Todd. I agree, the loss of indexing might not be that big a problem in practice. The only way to find out it to try an experiment I guess.
>
> Sven
>
>> On 06 Dec 2015, at 17:37, Todd Blanchard <[hidden email]> wrote:
>>
>> (Resent because of bounce notification (email handling in osx is really beginning to annoy me).  Sorry if its a dup)
>>
>> I used to worry a lot about strings being indexable.  And then I eventually let go of that and realized that it isn't a particularly important property for them to have.
>>
>> I think you will find that UTF8 is generally the most convenient for a lot of things but its a bit like light in that you treat it alternately as a wave or particle depending on what you are trying to do.
>>
>> So goes strings - they can be treated alternately as streams or byte arrays (not character arrays - stop thinking in characters).  In practice, this tends to not be a problem since a lot of the times when you want to replace a character or pick out the nth one you are doing something very computerish and the characters you are working with are the single byte (ASCII legacy) variety.  You generally know when you can get away with that and when you can't.
>>
>> Otherwise you are most likely doing things that are best dealt with in a streaming paradigm.  For most computation, you come to realize you don't generally care how many characters but how much space (bytes) you need to store your chunk of text.  Collation is tricky and complicated in unicode in general but it isn't any worse in UTF8 than any other encoding.  You are still going to scan each sortable item from front to back to determine its order, regardless.
>>
>> Most of the outside world has settled on UTF8 and any ASCII file is already UTF8 - which is why it ends up being so convenient.  Most of our old text handling infrastructure can still handle UTF8 while it tends to choke on wider encodings.
>>
>> -Todd Blanchard
>>
>>> On Dec 6, 2015, at 07:23, H. Hirzel <[hidden email]> wrote:
>>>
>>>> We do the same thing, but that doesn't mean it's a good idea to create a
>>>> new String-like class having its content encoded in UTF-8, because
>>>> UTF-8-encoded strings can't be modified like regular strings. While it
>>>> would be possible to implement all operations, such implementation would
>>>> become the next SortedCollection (bad performance due to misuse).
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

Max Leske
In reply to this post by Sven Van Caekenberghe-2

> On 06 Dec 2015, at 18:44, Sven Van Caekenberghe <[hidden email]> wrote:
>
>
>> On 05 Dec 2015, at 17:35, Todd Blanchard <[hidden email]> wrote:
>>
>> would suggest that the only worthwhile encoding is UTF8 - the rest are distractions except for being able to read and convert from other encodings to UTF8. UTF16 is a complete waste of time.
>>
>> Read http://utf8everywhere.org/
>>
>> I have extensive Unicode chops from around 1999 to 2004 and my experience leads me to strongly agree with the views on that site.
>
> Well, I read the page/document/site as well. It was very interesting indeed, thanks for sharing it.
>
> In some sense it made me reconsider my aversion against in-image utf-8 encoding, maybe it could have some value. Absolute storage is more efficient, some processing might also be more efficient, i/o conversions to/from utf-8 become a no-op. What I found nice is the suggestion that most structured parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a large part and just assume its ASCII, which would/could be nice for performance. Also the fact that a lot of strings are (or should be) treated as opaque makes a lot of sense.
>
> What I did not like is that much of argumentation is based on issue in the Windows world, take all that away and the document shrinks in half. I would have liked a bit more fundamental CS arguments.
>
> Canonicalisation and sorting issues are hardly discussed.
>
> In one place, the fact that a lot of special characters can have multiple representations is a big argument, while it is not mentioned how just treating things like a byte sequence would solve this (it doesn't AFAIU). Like how do you search for $e or $é if you know that it is possible to represent $é as just $é and as $e + $´ ?

That’s what normalization is for: http://unicode.org/faq/normalization.html. It will generate the same codepoint for two strings where one contains the combining character and the other is a “single character”.

>
> Sven
>
>> Sent from the road
>>
>> On Dec 5, 2015, at 05:08, stepharo <[hidden email]> wrote:
>>
>>> Hi EuanM
>>>
>>> Le 4/12/15 12:42, EuanM a écrit :
>>>> I'm currently groping my way to seeing how feature-complete our
>>>> Unicode support is.  I am doing this to establish what still needs to
>>>> be done to provide full Unicode support.
>>>
>>> this is great. Thanks for pushing this. I wrote and collected some roadmap (analyses on different topics)
>>> on the pharo github project feel free to add this one there.
>>>>
>>>> This seems to me to be an area where it would be best to write it
>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>> that most share a common ancestry.
>>>>
>>>> I am keen to get: equality-testing for strings; sortability for
>>>> strings which have ligatures and diacritic characters; and correct
>>>> round-tripping of data.
>>> Go!
>>> My suggestion is
>>>   start small
>>>   make steady progress
>>>   write tests
>>>   commit often :)
>>>
>>> Stef
>>>
>>> What is the french phoneBook ordering because this is the first time I hear about it.
>>>>
>>>> Call to action:
>>>> ==========
>>>>
>>>> If you have comments on these proposals - such as "but we already have
>>>> that facility" or "the reason we do not have these facilities is
>>>> because they are dog-slow" - please let me know them.
>>>>
>>>> If you would like to help out, please let me know.
>>>>
>>>> If you have Unicode experience and expertise, and would like to be, or
>>>> would be willing to be, in the  'council of experts' for this project,
>>>> please let me know.
>>>>
>>>> If you have comments or ideas on anything mentioned in this email
>>>>
>>>> In the first instance, the initiative's website will be:
>>>> http://smalltalk.uk.to/unicode.html
>>>>
>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>
>>>> I want to avoid re-inventing any facilities which already exist.
>>>> Except where they prevent us reaching the goals of:
>>>>  - sortable UTF8 strings
>>>>  - sortable UTF16 strings
>>>>  - equivalence testing of 2 UTF8 strings
>>>>  - equivalence testing of 2 UTF16 strings
>>>>  - round-tripping UTF8 strings through Smalltalk
>>>>  - roundtripping UTF16 strings through Smalltalk.
>>>> As I understand it, we have limited Unicode support atm.
>>>>
>>>> Current state of play
>>>> ===============
>>>> ByteString gets converted to WideString when need is automagically detected.
>>>>
>>>> Is there anything else that currently exists?
>>>>
>>>> Definition of Terms
>>>> ==============
>>>> A quick definition of terms before I go any further:
>>>>
>>>> Standard terms from the Unicode standard
>>>> ===============================
>>>> a compatibility character : an additional encoding of a *normal*
>>>> character, for compatibility and round-trip conversion purposes.  For
>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>
>>>> Made-up terms
>>>> ============
>>>> a convenience codepoint :  a single codepoint which represents an item
>>>> that is also encoded as a string of codepoints.
>>>>
>>>> (I tend to use the terms compatibility character and compatibility
>>>> codepoint interchangably.  The standard only refers to them as
>>>> compatibility characters.  However, the standard is determined to
>>>> emphasise that characters are abstract and that codepoints are
>>>> concrete.  So I think it is often more useful and productive to think
>>>> of compatibility or convenience codepoints).
>>>>
>>>> a composed character :  a character made up of several codepoints
>>>>
>>>> Unicode encoding explained
>>>> =====================
>>>> A convenience codepoint can therefore be thought of as a code point
>>>> used for a character which also has a composed form.
>>>>
>>>> The way Unicode works is that sometimes you can encode a character in
>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>> sometimes not.
>>>>
>>>> You can therefore have a long stream of ASCII which is single-byte
>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>> stream, it would be represented either by a compatibility character or
>>>> by a multi-byte combination.
>>>>
>>>> Using compatibility characters can prevent proper sorting and
>>>> equivalence testing.
>>>>
>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>> compatibility issues and round-tripping problems.
>>>>
>>>> Currently my thinking is:
>>>>
>>>> a Utf8String class
>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>> but short arrays of wider strings where necessary
>>>> a Utf16String class
>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>> but short arrays of wider strings
>>>> beginning with a 2-byte endianness indicator.
>>>>
>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>>>>
>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>> for round-tripping.  And where there are multiple convenience
>>>> codepoints for a character, that it standardises on one.
>>>>
>>>> And that there is a Utf8SortableString which uses *only* normal characters.
>>>>
>>>> We then need methods to convert between the two.
>>>>
>>>> aUtf8String asUtf8SortableString
>>>>
>>>> and
>>>>
>>>> aUtf8SortableString asUtf8String
>>>>
>>>>
>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>> countries have one order in general usage, and another for specific
>>>> usages, such as phone directories (e.g. UK and France)
>>>>
>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>> conversion methods
>>>>
>>>> A list of sorted words would be a SortedCollection, and there could be
>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>> seOrder, ukOrder, etc
>>>>
>>>> along the lines of
>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>
>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>> then we can perform equivalence testing on them trivially.
>>>>
>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>> of cleaning up any convenience codepoints which were valid, but which
>>>> were for a character which has multiple equally-valid alternative
>>>> convenience codepoints, and for which the string currently had the
>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>> alternative convenience codepoints, we would choose one to be in the
>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>> alternative convenience codepoints out of the string, and replacing
>>>> them with the chosen approved convenience codepoint.
>>>>
>>>> aUtf8String cleanUtf8String
>>>>
>>>> With WideString, a lot of the issues disappear - except
>>>> round-tripping(although I'm sure I have seen something recently about
>>>> 4-byte strings that also have an additional bit.  Which would make
>>>> some Unicode characters 5-bytes long.)
>>>>
>>>>
>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>> subtle, or somewhere in between, please let me know)
>>>>
>>>> Cheers,
>>>>    Euan
>>>>
>>>>
>>>
>>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Unicode Support

bestlem
In reply to this post by Sven Van Caekenberghe-2
On 06/12/2015 19:08, Sven Van Caekenberghe wrote:
> BTW, does anyone know of any programming language that did go that way or has a library that directly implements 'storing all strings as utf-8' ?
Java is UTF-16

Python3, Go and Swift are UTF-8 as I suspect are other new languages not
based on .Net or the JVM

Mark

>


>> On 06 Dec 2015, at 18:45, Sven Van Caekenberghe <[hidden email]> wrote:
>>
>> Well written, Todd. I agree, the loss of indexing might not be that big a problem in practice. The only way to find out it to try an experiment I guess.
>>
>> Sven
>>
>>> On 06 Dec 2015, at 17:37, Todd Blanchard <[hidden email]> wrote:
>>>
>>> (Resent because of bounce notification (email handling in osx is really beginning to annoy me).  Sorry if its a dup)
>>>
>>> I used to worry a lot about strings being indexable.  And then I eventually let go of that and realized that it isn't a particularly important property for them to have.
>>>
>>> I think you will find that UTF8 is generally the most convenient for a lot of things but its a bit like light in that you treat it alternately as a wave or particle depending on what you are trying to do.
>>>
>>> So goes strings - they can be treated alternately as streams or byte arrays (not character arrays - stop thinking in characters).  In practice, this tends to not be a problem since a lot of the times when you want to replace a character or pick out the nth one you are doing something very computerish and the characters you are working with are the single byte (ASCII legacy) variety.  You generally know when you can get away with that and when you can't.
>>>
>>> Otherwise you are most likely doing things that are best dealt with in a streaming paradigm.  For most computation, you come to realize you don't generally care how many characters but how much space (bytes) you need to store your chunk of text.  Collation is tricky and complicated in unicode in general but it isn't any worse in UTF8 than any other encoding.  You are still going to scan each sortable item from front to back to determine its order, regardless.
>>>
>>> Most of the outside world has settled on UTF8 and any ASCII file is already UTF8 - which is why it ends up being so convenient.  Most of our old text handling infrastructure can still handle UTF8 while it tends to choke on wider encodings.
>>>
>>> -Todd Blanchard
>>>
>>>> On Dec 6, 2015, at 07:23, H. Hirzel <[hidden email]> wrote:
>>>>
>>>>> We do the same thing, but that doesn't mean it's a good idea to create a
>>>>> new String-like class having its content encoded in UTF-8, because
>>>>> UTF-8-encoded strings can't be modified like regular strings. While it
>>>>> would be possible to implement all operations, such implementation would
>>>>> become the next SortedCollection (bad performance due to misuse).
>>>
>>
>
>
>


--
Mark


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Unicode Support

Sven Van Caekenberghe-2
Ah yes, Go is a good example, thanks Mark.

After reading these two blog articles:

 http://blog.golang.org/strings
 http://blog.golang.org/normalization

And especially after looking through their libraries/APIs, the conclusion is: this is not simple.

I am also not so sure they managed to offer such an API to their users so that it is easier for them not to make mistakes (given that they leave a lot of things 'open').

Note that they also say that "In practice, 99.98% of the web's HTML page content is in NFC form (not counting markup, in which case it would be more)." I must say that I have never come across anything else myself, let alone something that gave a problem, but that probably depends on the situation.

@Max: you seem to suggest that you do see non-normalised unicode, where does it come from, how does it happen ?

Sven

> On 06 Dec 2015, at 20:27, Mark Bestley <[hidden email]> wrote:
>
> On 06/12/2015 19:08, Sven Van Caekenberghe wrote:
>> BTW, does anyone know of any programming language that did go that way or has a library that directly implements 'storing all strings as utf-8' ?
> Java is UTF-16
>
> Python3, Go and Swift are UTF-8 as I suspect are other new languages not based on .Net or the JVM
>
> Mark
>
>>
>
>
>>> On 06 Dec 2015, at 18:45, Sven Van Caekenberghe <[hidden email]> wrote:
>>>
>>> Well written, Todd. I agree, the loss of indexing might not be that big a problem in practice. The only way to find out it to try an experiment I guess.
>>>
>>> Sven
>>>
>>>> On 06 Dec 2015, at 17:37, Todd Blanchard <[hidden email]> wrote:
>>>>
>>>> (Resent because of bounce notification (email handling in osx is really beginning to annoy me).  Sorry if its a dup)
>>>>
>>>> I used to worry a lot about strings being indexable.  And then I eventually let go of that and realized that it isn't a particularly important property for them to have.
>>>>
>>>> I think you will find that UTF8 is generally the most convenient for a lot of things but its a bit like light in that you treat it alternately as a wave or particle depending on what you are trying to do.
>>>>
>>>> So goes strings - they can be treated alternately as streams or byte arrays (not character arrays - stop thinking in characters).  In practice, this tends to not be a problem since a lot of the times when you want to replace a character or pick out the nth one you are doing something very computerish and the characters you are working with are the single byte (ASCII legacy) variety.  You generally know when you can get away with that and when you can't.
>>>>
>>>> Otherwise you are most likely doing things that are best dealt with in a streaming paradigm.  For most computation, you come to realize you don't generally care how many characters but how much space (bytes) you need to store your chunk of text.  Collation is tricky and complicated in unicode in general but it isn't any worse in UTF8 than any other encoding.  You are still going to scan each sortable item from front to back to determine its order, regardless.
>>>>
>>>> Most of the outside world has settled on UTF8 and any ASCII file is already UTF8 - which is why it ends up being so convenient.  Most of our old text handling infrastructure can still handle UTF8 while it tends to choke on wider encodings.
>>>>
>>>> -Todd Blanchard
>>>>
>>>>> On Dec 6, 2015, at 07:23, H. Hirzel <[hidden email]> wrote:
>>>>>
>>>>>> We do the same thing, but that doesn't mean it's a good idea to create a
>>>>>> new String-like class having its content encoded in UTF-8, because
>>>>>> UTF-8-encoded strings can't be modified like regular strings. While it
>>>>>> would be possible to implement all operations, such implementation would
>>>>>> become the next SortedCollection (bad performance due to misuse).
>>>>
>>>
>>
>>
>>
>
>
> --
> Mark


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Unicode Support

gcotelli
In reply to this post by bestlem

As far as I know Dart also uses utf-16 for Strings

On Dec 6, 2015 16:33, "Mark Bestley" <[hidden email]> wrote:
On 06/12/2015 19:08, Sven Van Caekenberghe wrote:
BTW, does anyone know of any programming language that did go that way or has a library that directly implements 'storing all strings as utf-8' ?
Java is UTF-16

Python3, Go and Swift are UTF-8 as I suspect are other new languages not based on .Net or the JVM

Mark




On 06 Dec 2015, at 18:45, Sven Van Caekenberghe <[hidden email]> wrote:

Well written, Todd. I agree, the loss of indexing might not be that big a problem in practice. The only way to find out it to try an experiment I guess.

Sven

On 06 Dec 2015, at 17:37, Todd Blanchard <[hidden email]> wrote:

(Resent because of bounce notification (email handling in osx is really beginning to annoy me).  Sorry if its a dup)

I used to worry a lot about strings being indexable.  And then I eventually let go of that and realized that it isn't a particularly important property for them to have.

I think you will find that UTF8 is generally the most convenient for a lot of things but its a bit like light in that you treat it alternately as a wave or particle depending on what you are trying to do.

So goes strings - they can be treated alternately as streams or byte arrays (not character arrays - stop thinking in characters).  In practice, this tends to not be a problem since a lot of the times when you want to replace a character or pick out the nth one you are doing something very computerish and the characters you are working with are the single byte (ASCII legacy) variety.  You generally know when you can get away with that and when you can't.

Otherwise you are most likely doing things that are best dealt with in a streaming paradigm.  For most computation, you come to realize you don't generally care how many characters but how much space (bytes) you need to store your chunk of text.  Collation is tricky and complicated in unicode in general but it isn't any worse in UTF8 than any other encoding.  You are still going to scan each sortable item from front to back to determine its order, regardless.

Most of the outside world has settled on UTF8 and any ASCII file is already UTF8 - which is why it ends up being so convenient.  Most of our old text handling infrastructure can still handle UTF8 while it tends to choke on wider encodings.

-Todd Blanchard

On Dec 6, 2015, at 07:23, H. Hirzel <[hidden email]> wrote:

We do the same thing, but that doesn't mean it's a good idea to create a
new String-like class having its content encoded in UTF-8, because
UTF-8-encoded strings can't be modified like regular strings. While it
would be possible to implement all operations, such implementation would
become the next SortedCollection (bad performance due to misuse).







--
Mark


Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

EuanM
In reply to this post by stepharo
This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
http://smalltalk.uk.to/unicode-utf8.html
and my Smalltalk in Small Steps blog at:
http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html

My current thinking, and understanding.
==============================

0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
    b) UTF-8 can encode all of those characters in 1 byte, but can
prefer some of them to be encoded as sequences of multiple bytes.  And
can encode additional characters as sequences of multiple bytes.

1) Smalltalk has long had multiple String classes.

2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
    is encoded as a UTF-8 codepoint of nn hex.

3) All valid ISO-8859-1 characters have a character code between 20
hex and 7E hex, or between A0 hex and FF hex.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

4) All valid ASCII characters have a character code between 00 hex and 7E hex.
https://en.wikipedia.org/wiki/ASCII


5) a) All character codes which are defined within ISO-8859-1 and also
defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
defined identically in both.

b) All printable ASCII characters are defined identically in both
ASCII and ISO-8859-1

6) All character codes defined in ASCII  (00 hex to 7E hex) are
defined identically in Unicode UTF-8.

7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
- FF hex ) are defined identically in UTF-8.

8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
         all ASCII maps 1:1 to Unicode UTF-8
         all ISO-8859-1 maps 1:1 to Unicode UTF-8

9) All ByteStrings elements which are either a valid ISO-8859-1
character  or a valid ASCII character are *also* a valid UTF-8
character.

10) ISO-8859-1 characters representing a character with a diacritic,
or a two-character ligature, have no ASCII equivalent.  In Unicode
UTF-8, those character codes which are representing compound glyphs,
are called "compatibility codepoints".

11) The preferred Unicode representation of the characters which have
compatibility codepoints is as a  a short set of codepoints
representing the characters which are combined together to form the
glyph of the convenience codepoint, as a sequence of bytes
representing the component characters.


12) Some concrete examples:

A - aka Upper Case A
In ASCII, in ISO 8859-1
ASCII A - 41 hex
ISO-8859-1 A - 41 hex
UTF-8 A - 41 hex

BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
In ASCII, not in ISO 8859-1
ASCII : BEL  - 07 hex
ISO-8859-1 : 07 hex is not a valid character code
UTF-8 : BEL - 07 hex

£ (GBP currency symbol)
In ISO-8859-1, not in ASCII
ASCII : A3 hex is not a valid ASCII code
UTF-8: £ - A3 hex
ISO-8859-1: £ - A3 hex

Upper Case C cedilla
In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
*and* a composed set of codepoints
ASCII : C7 hex is not a valid ASCII character code
ISO-8859-1 : Upper Case C cedilla - C7 hex
UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
Unicode preferred Upper Case C cedilla  (composed set of codepoints)
   Upper case C 0043 hex (Upper case C)
       followed by
   cedilla 00B8 hex (cedilla)

13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
aByteString is completely adequate for editing and display.

14) When sorting any valid ASCII string *or* any valid ISO-8859-1
string, upper and lower case versions of the same character will be
treated differently.

15) When sorting any valid ISO-8859-1 string containing
letter+diacritic combination glyphs or ligature combination glyphs,
the glyphs in combination will treated differently to a "plain" glyph
of the character
i.e. "C" and "C cedilla" will be treated very differently.  "ß" and
"fs" will be treated very differently.

16) Different nations have different rules about where diacritic-ed
characted and ligature pairs should be placed when in alphabetical
order.

17) Some nations even have multiple standards - e.g.  surnames
beginning either "M superscript-c" or "M superscript-a superscript-c"
are treated as beginning equivalently in UK phone directories, but not
in other situations.


Some practical upshots
==================

1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,
for any single character it considers valid, or any ByteString it has
made up of characters it considers valid.

2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any
other Smalltalk with a single byte ByteString following ASCII or
ISO-8859-1.

3) Any Smalltalk (or derivative language) using ByteString can
immediately consider it's ByteString as valid UTF-8, as long as it
also considers the ByteSring as valid ASCII and/or ISO-8859-1.

4) All of those can be successfully exported to any system using UTF-8
(e.g. HTML).

5) To successfully *accept* all UTF-8 we much be able to do either:
a) accept UTF-8 strings with composed characters
b) convert UTF-8 strings with composed characters into UTF-8 strings
that use *only* compatibility codepoints.


Class + protocol proposals



a Utf8CompatibilityString class.

   asByteString  - ensure only compatibility codepoints are used.
Ensure it doews not encode characters above 00FF hex.

   asIso8859String - ensures only compatibility codepoints are used,
and that the characters are each valid ISO 8859-1

   asAsciiString - ensures only characters 00hex - 7F hex are used.

   asUtf8ComposedIso8859String - ensures all compatibility codepoints
are expanded into small OrderedCollections of codepoints

a Utf8ComposedIso8859String class - will provide sortable and
comparable UTF8 strings of all ASCII and ISO 8859-1 strings.

Then a Utf8SortableCollection class - a collection of
Utf8ComposedIso8859Strings words and phrases.

Custom sortBlocks will define the applicable sort order.

We can create a collection...  a Dictionary, thinking about it, of
named, prefabricated sortBlocks.

This will work for all UTF8 strings of ISO-8859-1 and ASCII strings.

If anyone has better names for the classes, please let me know.

If anyone else wants to help
    - build these,
    - create SUnit tests for these
    - write documentation for these
Please let me know.

n.b. I have had absolutely no experience of Ropes.

My own background with this stuff:  In the early 90's as a Project
Manager implementing office automation systems across a global
company, with offices in the Americas, Western, Eastern and Central
Europe, (including Slavic and Cyrillic users) nations, Japan and
China. The mission-critical application was word-processing.

Our offices were spread around the globe, and we needed those offices
to successfully exchange documents with their sister offices, and with
the customers in each region the offices were in.

Unicode was then new, and our platform supplier was the NeXT
Corporation, who had been founder members in of the Unicode Consortium
in 1990.

So far: I've read the latest version of the Unicode Standard (v8.0).
This is freely downloadable.
I've purchased a paper copy of an earlier release.  New releases
typically consist additional codespaces (i.e. alphabets).  So old
copies are useful, as well as cheap.  (Paper copies of  version 4.0
are available second-hand for < $10 / €10).

The typical change with each release is the addition of further
codespaces (i.e alphabets (more or less) ), so you don't lose a lot.
(I'll be going through my V4.0 just to make sure)

Cheers,
   Euan




On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote:

> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>>
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
>
>
> this is great. Thanks for pushing this. I wrote and collected some roadmap
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>>
>>
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>>
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
>
> Go!
> My suggestion is
>     start small
>     make steady progress
>     write tests
>     commit often :)
>
> Stef
>
> What is the french phoneBook ordering because this is the first time I hear
> about it.
>
>>
>> Call to action:
>> ==========
>>
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>>
>> If you would like to help out, please let me know.
>>
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>>
>> If you have comments or ideas on anything mentioned in this email
>>
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>>
>> I have created a SqueakSource.com project called UnicodeSupport
>>
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>    - sortable UTF8 strings
>>    - sortable UTF16 strings
>>    - equivalence testing of 2 UTF8 strings
>>    - equivalence testing of 2 UTF16 strings
>>    - round-tripping UTF8 strings through Smalltalk
>>    - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>>
>> Current state of play
>> ===============
>> ByteString gets converted to WideString when need is automagically
>> detected.
>>
>> Is there anything else that currently exists?
>>
>> Definition of Terms
>> ==============
>> A quick definition of terms before I go any further:
>>
>> Standard terms from the Unicode standard
>> ===============================
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>
>> Made-up terms
>> ============
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>>
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>>
>> a composed character :  a character made up of several codepoints
>>
>> Unicode encoding explained
>> =====================
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>>
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>>
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>>
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>>
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>>
>> Currently my thinking is:
>>
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginning with a 2-byte endianness indicator.
>>
>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>> compatible.
>>
>> So my thinking is that Utf8String will contain convenience codepoints,
>> for round-tripping.  And where there are multiple convenience
>> codepoints for a character, that it standardises on one.
>>
>> And that there is a Utf8SortableString which uses *only* normal
>> characters.
>>
>> We then need methods to convert between the two.
>>
>> aUtf8String asUtf8SortableString
>>
>> and
>>
>> aUtf8SortableString asUtf8String
>>
>>
>> Sort orders are culture and context dependent - Sweden and Germany
>> have different sort orders for the same diacritic-ed characters.  Some
>> countries have one order in general usage, and another for specific
>> usages, such as phone directories (e.g. UK and France)
>>
>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>> conversion methods
>>
>> A list of sorted words would be a SortedCollection, and there could be
>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>> seOrder, ukOrder, etc
>>
>> along the lines of
>> aListOfWords := SortedCollection sortBlock: deOrder
>>
>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>> then we can perform equivalence testing on them trivially.
>>
>> To make sure a Utf8String is well formed, we would need to have a way
>> of cleaning up any convenience codepoints which were valid, but which
>> were for a character which has multiple equally-valid alternative
>> convenience codepoints, and for which the string currently had the
>> "wrong" convenience codepoint.  (i.e for any character with valid
>> alternative convenience codepoints, we would choose one to be in the
>> well-formed Utf8String, and we would need a method for cleaning the
>> alternative convenience codepoints out of the string, and replacing
>> them with the chosen approved convenience codepoint.
>>
>> aUtf8String cleanUtf8String
>>
>> With WideString, a lot of the issues disappear - except
>> round-tripping(although I'm sure I have seen something recently about
>> 4-byte strings that also have an additional bit.  Which would make
>> some Unicode characters 5-bytes long.)
>>
>>
>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>> subtle, or somewhere in between, please let me know)
>>
>> Cheers,
>>      Euan
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

EuanM
In reply to this post by stepharo
Thanks for those pointers, Steph.  I'll make sure they are on my
reading list.  (I have a limited weekly time-budget for Unicode work,
but I expect this is a long-term project).

I'll keep in touch with Steph, so any new facilities can be
immediately useful to Pharo, and someone can guide them to a proper
home in Pharo's Class hierarchy.

For now, I've stuck stuff on my blog,
http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
in an email here
and at smalltalk.uk.to/unicode-utf.html


On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote:

> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>>
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
>
>
> this is great. Thanks for pushing this. I wrote and collected some roadmap
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>>
>>
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>>
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
>
> Go!
> My suggestion is
>     start small
>     make steady progress
>     write tests
>     commit often :)
>
> Stef
>
> What is the french phoneBook ordering because this is the first time I hear
> about it.
>
>>
>> Call to action:
>> ==========
>>
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>>
>> If you would like to help out, please let me know.
>>
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>>
>> If you have comments or ideas on anything mentioned in this email
>>
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>>
>> I have created a SqueakSource.com project called UnicodeSupport
>>
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>    - sortable UTF8 strings
>>    - sortable UTF16 strings
>>    - equivalence testing of 2 UTF8 strings
>>    - equivalence testing of 2 UTF16 strings
>>    - round-tripping UTF8 strings through Smalltalk
>>    - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>>
>> Current state of play
>> ===============
>> ByteString gets converted to WideString when need is automagically
>> detected.
>>
>> Is there anything else that currently exists?
>>
>> Definition of Terms
>> ==============
>> A quick definition of terms before I go any further:
>>
>> Standard terms from the Unicode standard
>> ===============================
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>
>> Made-up terms
>> ============
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>>
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>>
>> a composed character :  a character made up of several codepoints
>>
>> Unicode encoding explained
>> =====================
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>>
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>>
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>>
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>>
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>>
>> Currently my thinking is:
>>
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginning with a 2-byte endianness indicator.
>>
>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>> compatible.
>>
>> So my thinking is that Utf8String will contain convenience codepoints,
>> for round-tripping.  And where there are multiple convenience
>> codepoints for a character, that it standardises on one.
>>
>> And that there is a Utf8SortableString which uses *only* normal
>> characters.
>>
>> We then need methods to convert between the two.
>>
>> aUtf8String asUtf8SortableString
>>
>> and
>>
>> aUtf8SortableString asUtf8String
>>
>>
>> Sort orders are culture and context dependent - Sweden and Germany
>> have different sort orders for the same diacritic-ed characters.  Some
>> countries have one order in general usage, and another for specific
>> usages, such as phone directories (e.g. UK and France)
>>
>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>> conversion methods
>>
>> A list of sorted words would be a SortedCollection, and there could be
>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>> seOrder, ukOrder, etc
>>
>> along the lines of
>> aListOfWords := SortedCollection sortBlock: deOrder
>>
>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>> then we can perform equivalence testing on them trivially.
>>
>> To make sure a Utf8String is well formed, we would need to have a way
>> of cleaning up any convenience codepoints which were valid, but which
>> were for a character which has multiple equally-valid alternative
>> convenience codepoints, and for which the string currently had the
>> "wrong" convenience codepoint.  (i.e for any character with valid
>> alternative convenience codepoints, we would choose one to be in the
>> well-formed Utf8String, and we would need a method for cleaning the
>> alternative convenience codepoints out of the string, and replacing
>> them with the chosen approved convenience codepoint.
>>
>> aUtf8String cleanUtf8String
>>
>> With WideString, a lot of the issues disappear - except
>> round-tripping(although I'm sure I have seen something recently about
>> 4-byte strings that also have an additional bit.  Which would make
>> some Unicode characters 5-bytes long.)
>>
>>
>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>> subtle, or somewhere in between, please let me know)
>>
>> Cheers,
>>      Euan
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

EuanM
In reply to this post by Sven Van Caekenberghe-2
"Canonicalisation and sorting issues are hardly discussed.

In one place, the fact that a lot of special characters can have
multiple representations is a big argument, while it is not mentioned
how just treating things like a byte sequence would solve this (it
doesn't AFAIU). Like how do you search for $e or $é if you know that
it is possible to represent $é as just $é and as $e + $´ ?"

This, for me, is one of the chief purposes of Unicode support.

What you have it a convertor for "might contain compatibility
codepoints" to "contains only composed sequences of codepoints, and no
compatibility codepoints".  As long as you're not using Strings where
you should use Streams, it should be okay.


And of course, for passing back to ISO Latin 1 or ASCII systems, you
need to have a convertor to "contains only compatibility codepoints,
and no composed sets of codepoints".

As long as you can tell one type from the other, it's not a problem.

Any string that mixes both can be converted in either direction by the
same methods which I've just outlined.

Once you have these, we can do this for all 1 byte characters.

We can then expand this to have Classes and methods for character
strings which contain the occasional character from other ISO
character sets.

Cheers,
    Euan


On 6 December 2015 at 17:44, Sven Van Caekenberghe <[hidden email]> wrote:

>
>> On 05 Dec 2015, at 17:35, Todd Blanchard <[hidden email]> wrote:
>>
>> would suggest that the only worthwhile encoding is UTF8 - the rest are distractions except for being able to read and convert from other encodings to UTF8. UTF16 is a complete waste of time.
>>
>> Read http://utf8everywhere.org/
>>
>> I have extensive Unicode chops from around 1999 to 2004 and my experience leads me to strongly agree with the views on that site.
>
> Well, I read the page/document/site as well. It was very interesting indeed, thanks for sharing it.
>
> In some sense it made me reconsider my aversion against in-image utf-8 encoding, maybe it could have some value. Absolute storage is more efficient, some processing might also be more efficient, i/o conversions to/from utf-8 become a no-op. What I found nice is the suggestion that most structured parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a large part and just assume its ASCII, which would/could be nice for performance. Also the fact that a lot of strings are (or should be) treated as opaque makes a lot of sense.
>
> What I did not like is that much of argumentation is based on issue in the Windows world, take all that away and the document shrinks in half. I would have liked a bit more fundamental CS arguments.
>
> Canonicalisation and sorting issues are hardly discussed.
>
> In one place, the fact that a lot of special characters can have multiple representations is a big argument, while it is not mentioned how just treating things like a byte sequence would solve this (it doesn't AFAIU). Like how do you search for $e or $é if you know that it is possible to represent $é as just $é and as $e + $´ ?
>
> Sven
>
>> Sent from the road
>>
>> On Dec 5, 2015, at 05:08, stepharo <[hidden email]> wrote:
>>
>>> Hi EuanM
>>>
>>> Le 4/12/15 12:42, EuanM a écrit :
>>>> I'm currently groping my way to seeing how feature-complete our
>>>> Unicode support is.  I am doing this to establish what still needs to
>>>> be done to provide full Unicode support.
>>>
>>> this is great. Thanks for pushing this. I wrote and collected some roadmap (analyses on different topics)
>>> on the pharo github project feel free to add this one there.
>>>>
>>>> This seems to me to be an area where it would be best to write it
>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>> that most share a common ancestry.
>>>>
>>>> I am keen to get: equality-testing for strings; sortability for
>>>> strings which have ligatures and diacritic characters; and correct
>>>> round-tripping of data.
>>> Go!
>>> My suggestion is
>>>    start small
>>>    make steady progress
>>>    write tests
>>>    commit often :)
>>>
>>> Stef
>>>
>>> What is the french phoneBook ordering because this is the first time I hear about it.
>>>>
>>>> Call to action:
>>>> ==========
>>>>
>>>> If you have comments on these proposals - such as "but we already have
>>>> that facility" or "the reason we do not have these facilities is
>>>> because they are dog-slow" - please let me know them.
>>>>
>>>> If you would like to help out, please let me know.
>>>>
>>>> If you have Unicode experience and expertise, and would like to be, or
>>>> would be willing to be, in the  'council of experts' for this project,
>>>> please let me know.
>>>>
>>>> If you have comments or ideas on anything mentioned in this email
>>>>
>>>> In the first instance, the initiative's website will be:
>>>> http://smalltalk.uk.to/unicode.html
>>>>
>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>
>>>> I want to avoid re-inventing any facilities which already exist.
>>>> Except where they prevent us reaching the goals of:
>>>>   - sortable UTF8 strings
>>>>   - sortable UTF16 strings
>>>>   - equivalence testing of 2 UTF8 strings
>>>>   - equivalence testing of 2 UTF16 strings
>>>>   - round-tripping UTF8 strings through Smalltalk
>>>>   - roundtripping UTF16 strings through Smalltalk.
>>>> As I understand it, we have limited Unicode support atm.
>>>>
>>>> Current state of play
>>>> ===============
>>>> ByteString gets converted to WideString when need is automagically detected.
>>>>
>>>> Is there anything else that currently exists?
>>>>
>>>> Definition of Terms
>>>> ==============
>>>> A quick definition of terms before I go any further:
>>>>
>>>> Standard terms from the Unicode standard
>>>> ===============================
>>>> a compatibility character : an additional encoding of a *normal*
>>>> character, for compatibility and round-trip conversion purposes.  For
>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>
>>>> Made-up terms
>>>> ============
>>>> a convenience codepoint :  a single codepoint which represents an item
>>>> that is also encoded as a string of codepoints.
>>>>
>>>> (I tend to use the terms compatibility character and compatibility
>>>> codepoint interchangably.  The standard only refers to them as
>>>> compatibility characters.  However, the standard is determined to
>>>> emphasise that characters are abstract and that codepoints are
>>>> concrete.  So I think it is often more useful and productive to think
>>>> of compatibility or convenience codepoints).
>>>>
>>>> a composed character :  a character made up of several codepoints
>>>>
>>>> Unicode encoding explained
>>>> =====================
>>>> A convenience codepoint can therefore be thought of as a code point
>>>> used for a character which also has a composed form.
>>>>
>>>> The way Unicode works is that sometimes you can encode a character in
>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>> sometimes not.
>>>>
>>>> You can therefore have a long stream of ASCII which is single-byte
>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>> stream, it would be represented either by a compatibility character or
>>>> by a multi-byte combination.
>>>>
>>>> Using compatibility characters can prevent proper sorting and
>>>> equivalence testing.
>>>>
>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>> compatibility issues and round-tripping problems.
>>>>
>>>> Currently my thinking is:
>>>>
>>>> a Utf8String class
>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>> but short arrays of wider strings where necessary
>>>> a Utf16String class
>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>> but short arrays of wider strings
>>>> beginning with a 2-byte endianness indicator.
>>>>
>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>>>>
>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>> for round-tripping.  And where there are multiple convenience
>>>> codepoints for a character, that it standardises on one.
>>>>
>>>> And that there is a Utf8SortableString which uses *only* normal characters.
>>>>
>>>> We then need methods to convert between the two.
>>>>
>>>> aUtf8String asUtf8SortableString
>>>>
>>>> and
>>>>
>>>> aUtf8SortableString asUtf8String
>>>>
>>>>
>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>> countries have one order in general usage, and another for specific
>>>> usages, such as phone directories (e.g. UK and France)
>>>>
>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>> conversion methods
>>>>
>>>> A list of sorted words would be a SortedCollection, and there could be
>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>> seOrder, ukOrder, etc
>>>>
>>>> along the lines of
>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>
>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>> then we can perform equivalence testing on them trivially.
>>>>
>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>> of cleaning up any convenience codepoints which were valid, but which
>>>> were for a character which has multiple equally-valid alternative
>>>> convenience codepoints, and for which the string currently had the
>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>> alternative convenience codepoints, we would choose one to be in the
>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>> alternative convenience codepoints out of the string, and replacing
>>>> them with the chosen approved convenience codepoint.
>>>>
>>>> aUtf8String cleanUtf8String
>>>>
>>>> With WideString, a lot of the issues disappear - except
>>>> round-tripping(although I'm sure I have seen something recently about
>>>> 4-byte strings that also have an additional bit.  Which would make
>>>> some Unicode characters 5-bytes long.)
>>>>
>>>>
>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>> subtle, or somewhere in between, please let me know)
>>>>
>>>> Cheers,
>>>>     Euan
>>>>
>>>>
>>>
>>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Unicode Support

EuanM
In reply to this post by stepharo
Steph -  I'll dig out the Fr phone book ordering from wherever it was
I read about it!

I thought I ghad it to hand, but I haven;t found it tonight. It can't
be far away.

On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote:

> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>>
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
>
>
> this is great. Thanks for pushing this. I wrote and collected some roadmap
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>>
>>
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>>
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
>
> Go!
> My suggestion is
>     start small
>     make steady progress
>     write tests
>     commit often :)
>
> Stef
>
> What is the french phoneBook ordering because this is the first time I hear
> about it.
>
>>
>> Call to action:
>> ==========
>>
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>>
>> If you would like to help out, please let me know.
>>
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>>
>> If you have comments or ideas on anything mentioned in this email
>>
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>>
>> I have created a SqueakSource.com project called UnicodeSupport
>>
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>    - sortable UTF8 strings
>>    - sortable UTF16 strings
>>    - equivalence testing of 2 UTF8 strings
>>    - equivalence testing of 2 UTF16 strings
>>    - round-tripping UTF8 strings through Smalltalk
>>    - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>>
>> Current state of play
>> ===============
>> ByteString gets converted to WideString when need is automagically
>> detected.
>>
>> Is there anything else that currently exists?
>>
>> Definition of Terms
>> ==============
>> A quick definition of terms before I go any further:
>>
>> Standard terms from the Unicode standard
>> ===============================
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>
>> Made-up terms
>> ============
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>>
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>>
>> a composed character :  a character made up of several codepoints
>>
>> Unicode encoding explained
>> =====================
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>>
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>>
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>>
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>>
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>>
>> Currently my thinking is:
>>
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginning with a 2-byte endianness indicator.
>>
>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>> compatible.
>>
>> So my thinking is that Utf8String will contain convenience codepoints,
>> for round-tripping.  And where there are multiple convenience
>> codepoints for a character, that it standardises on one.
>>
>> And that there is a Utf8SortableString which uses *only* normal
>> characters.
>>
>> We then need methods to convert between the two.
>>
>> aUtf8String asUtf8SortableString
>>
>> and
>>
>> aUtf8SortableString asUtf8String
>>
>>
>> Sort orders are culture and context dependent - Sweden and Germany
>> have different sort orders for the same diacritic-ed characters.  Some
>> countries have one order in general usage, and another for specific
>> usages, such as phone directories (e.g. UK and France)
>>
>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>> conversion methods
>>
>> A list of sorted words would be a SortedCollection, and there could be
>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>> seOrder, ukOrder, etc
>>
>> along the lines of
>> aListOfWords := SortedCollection sortBlock: deOrder
>>
>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>> then we can perform equivalence testing on them trivially.
>>
>> To make sure a Utf8String is well formed, we would need to have a way
>> of cleaning up any convenience codepoints which were valid, but which
>> were for a character which has multiple equally-valid alternative
>> convenience codepoints, and for which the string currently had the
>> "wrong" convenience codepoint.  (i.e for any character with valid
>> alternative convenience codepoints, we would choose one to be in the
>> well-formed Utf8String, and we would need a method for cleaning the
>> alternative convenience codepoints out of the string, and replacing
>> them with the chosen approved convenience codepoint.
>>
>> aUtf8String cleanUtf8String
>>
>> With WideString, a lot of the issues disappear - except
>> round-tripping(although I'm sure I have seen something recently about
>> 4-byte strings that also have an additional bit.  Which would make
>> some Unicode characters 5-bytes long.)
>>
>>
>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>> subtle, or somewhere in between, please let me know)
>>
>> Cheers,
>>      Euan
>>
>>
>
>

123