Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support)

EuanM
Hannes,

The Unicode standard provide compatibility codepoints for
compatibility purposes and prefers all characters to be represented
composed form - as that way they are comparable and sortable.

(Some composed characters have *more than one* compatibility codepoint.

The canonical example is the composed character #(0041 030a) which can
be represented by EITHER the compatibility codepoint #(00c5) "Latin
Capital Letter A with Ring" above OR by #(212b) "Angstrom sign"  )

On 7 December 2015 at 08:17, H. Hirzel <[hidden email]> wrote:

> On 12/7/15, EuanM <[hidden email]> wrote:
>> My current thinking for collation sequences:
>>
>> All strings being collated have had all compatibility codepoints
>> expanded into composed sequences.
>
> What does the Unicode manual suggest?  (www.unicode.org reference?)
>
>
>
>>
>> Strings containing composed sequences and UTF
>> -8 strings containing multi-byte characters have these represented by
>> a very short ordered collection in place of the single  Byte of a
>> ByteString.
>>
>> When we sort characters,  words or phrases of strings that contain
>> zero compatibility codepoints, we simply pull a pre-defined sortBlock
>> out of a Dictionary of pre-defined sortBlocks
>>
>>
>> aDictionaryOfSorts at: ukPhoneBook put: aSortBlock
>> or
>> aDictionaryOfSorts at: ukPhoneBook put: '[aString representing the
>> code of a sortBlock]' .
>>
>> ASortedCollectionOfUtf8Strings sortBlock: aDictionaryOfSorts at:
>> ukPhoneBook
>>
>>  - or some actual working code!  :-)
>>
>
> Yes, focusing on this is a real need.
>
>
>>
>> On 6 December 2015 at 15:14, H. Hirzel <[hidden email]> wrote:
>>> P.S. The 30-bit value for each character in Squeak/Pharo (if necessary
>>> together with an additional language tag) is a potentially very
>>> capable infrastructure. Not really used much at the moment.
>>>
>>> The challenge is to to make _existing_ Unicode-know-how defined
>>> elsewhere (e.g. www.unicode.org) available in Squeak/Pharo/Cuis.
>>>
>>> Most simple cases would be to start with collation sequences in
>>> Italian, French, German, Spanish, Portugese. Later on more complex
>>> cases like Arabic.
>>>
>>> --HH
>>>
>>> On 12/6/15, H. Hirzel <[hidden email]> wrote:
>>>> Hi Euan,
>>>>
>>>> On 12/4/15, EuanM <[hidden email]> wrote:
>>>>> I'm currently groping my way to seeing how feature-complete our
>>>>> Unicode support is.  I am doing this to establish what still needs to
>>>>> be done to provide full Unicode support.
>>>>>
>>>>> This seems to me to be an area where it would be best to write it
>>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>>> that most share a common ancestry.
>>>>>
>>>>> I am keen to get: equality-testing for strings; sortability for
>>>>> strings which have ligatures and diacritic characters; and correct
>>>>> round-tripping of data.
>>>>
>>>> These  goals call for a package with SUnit tests which you then can
>>>> run on all platforms. This will be a tool to evalutate platforms for
>>>> the level of Unicode support.
>>>> As mentioned in the thread I would focus on UTF8 only as far as
>>>> external files are concerned.
>>>> I.E. the test package writes a sample UFT8 file and then reads it to
>>>> do the various tests.
>>>> I have started doing this for Squeak and Cuis some time ago with a few
>>>> tests.
>>>>
>>>> I am interested in sortability. Round-tripping is fine if you go for
>>>> UTF8.
>>>> Important of course is which languages you think the package should
>>>> work. Some of them are easy, some not.
>>>>
>>>> This afternoon I did some updates on the Squeak wiki
>>>> http://wiki.squeak.org/squeak/recent
>>>>
>>>> --Hannes
>>>>
>>>>>
>>>>> Call to action:
>>>>> ==========
>>>>>
>>>>> If you have comments on these proposals - such as "but we already have
>>>>> that facility" or "the reason we do not have these facilities is
>>>>> because they are dog-slow" - please let me know them.
>>>>>
>>>>> If you would like to help out, please let me know.
>>>>>
>>>>> If you have Unicode experience and expertise, and would like to be, or
>>>>> would be willing to be, in the  'council of experts' for this project,
>>>>> please let me know.
>>>>>
>>>>> If you have comments or ideas on anything mentioned in this email
>>>>>
>>>>> In the first instance, the initiative's website will be:
>>>>> http://smalltalk.uk.to/unicode.html
>>>>>
>>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>>>
>>>>> I want to avoid re-inventing any facilities which already exist.
>>>>> Except where they prevent us reaching the goals of:
>>>>>   - sortable UTF8 strings
>>>>>   - sortable UTF16 strings
>>>>>   - equivalence testing of 2 UTF8 strings
>>>>>   - equivalence testing of 2 UTF16 strings
>>>>>   - round-tripping UTF8 strings through Smalltalk
>>>>>   - roundtripping UTF16 strings through Smalltalk.
>>>>> As I understand it, we have limited Unicode support atm.
>>>>>
>>>>> Current state of play
>>>>> ===============
>>>>> ByteString gets converted to WideString when need is automagically
>>>>> detected.
>>>>>
>>>>> Is there anything else that currently exists?
>>>>>
>>>>> Definition of Terms
>>>>> ==============
>>>>> A quick definition of terms before I go any further:
>>>>>
>>>>> Standard terms from the Unicode standard
>>>>> ===============================
>>>>> a compatibility character : an additional encoding of a *normal*
>>>>> character, for compatibility and round-trip conversion purposes.  For
>>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>>>
>>>>> Made-up terms
>>>>> ============
>>>>> a convenience codepoint :  a single codepoint which represents an item
>>>>> that is also encoded as a string of codepoints.
>>>>>
>>>>> (I tend to use the terms compatibility character and compatibility
>>>>> codepoint interchangably.  The standard only refers to them as
>>>>> compatibility characters.  However, the standard is determined to
>>>>> emphasise that characters are abstract and that codepoints are
>>>>> concrete.  So I think it is often more useful and productive to think
>>>>> of compatibility or convenience codepoints).
>>>>>
>>>>> a composed character :  a character made up of several codepoints
>>>>>
>>>>> Unicode encoding explained
>>>>> =====================
>>>>> A convenience codepoint can therefore be thought of as a code point
>>>>> used for a character which also has a composed form.
>>>>>
>>>>> The way Unicode works is that sometimes you can encode a character in
>>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>>> sometimes not.
>>>>>
>>>>> You can therefore have a long stream of ASCII which is single-byte
>>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>>> stream, it would be represented either by a compatibility character or
>>>>> by a multi-byte combination.
>>>>>
>>>>> Using compatibility characters can prevent proper sorting and
>>>>> equivalence testing.
>>>>>
>>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>>> compatibility issues and round-tripping problems.
>>>>>
>>>>> Currently my thinking is:
>>>>>
>>>>> a Utf8String class
>>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>>> but short arrays of wider strings where necessary
>>>>> a Utf16String class
>>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>>> but short arrays of wider strings
>>>>> beginning with a 2-byte endianness indicator.
>>>>>
>>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>>>>> compatible.
>>>>>
>>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>>> for round-tripping.  And where there are multiple convenience
>>>>> codepoints for a character, that it standardises on one.
>>>>>
>>>>> And that there is a Utf8SortableString which uses *only* normal
>>>>> characters.
>>>>>
>>>>> We then need methods to convert between the two.
>>>>>
>>>>> aUtf8String asUtf8SortableString
>>>>>
>>>>> and
>>>>>
>>>>> aUtf8SortableString asUtf8String
>>>>>
>>>>>
>>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>>> countries have one order in general usage, and another for specific
>>>>> usages, such as phone directories (e.g. UK and France)
>>>>>
>>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>>> conversion methods
>>>>>
>>>>> A list of sorted words would be a SortedCollection, and there could be
>>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>>> seOrder, ukOrder, etc
>>>>>
>>>>> along the lines of
>>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>>>
>>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>>> then we can perform equivalence testing on them trivially.
>>>>>
>>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>>> of cleaning up any convenience codepoints which were valid, but which
>>>>> were for a character which has multiple equally-valid alternative
>>>>> convenience codepoints, and for which the string currently had the
>>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>>> alternative convenience codepoints, we would choose one to be in the
>>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>>> alternative convenience codepoints out of the string, and replacing
>>>>> them with the chosen approved convenience codepoint.
>>>>>
>>>>> aUtf8String cleanUtf8String
>>>>>
>>>>> With WideString, a lot of the issues disappear - except
>>>>> round-tripping(although I'm sure I have seen something recently about
>>>>> 4-byte strings that also have an additional bit.  Which would make
>>>>> some Unicode characters 5-bytes long.)
>>>>>
>>>>>
>>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>>> subtle, or somewhere in between, please let me know)
>>>>>
>>>>> Cheers,
>>>>>     Euan
>>>>>
>>>>>
>>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [Unicode] collation sequences (Re: [squeak-dev] Unicode Support)

Henrik Sperre Johansen

On 07 Dec 2015, at 12:18 , EuanM <[hidden email]> wrote:

Hannes,

The Unicode standard provide compatibility codepoints for
compatibility purposes and prefers all characters to be represented
composed form - as that way they are comparable and sortable.

(Some composed characters have *more than one* compatibility codepoint.

The canonical example is the composed character #(0041 030a) which can
be represented by EITHER the compatibility codepoint #(00c5) "Latin
Capital Letter A with Ring" above OR by #(212b) "Angstrom sign"  )

On 7 December 2015 at 08:17, H. Hirzel <[hidden email]> wrote:
On 12/7/15, EuanM <[hidden email]> wrote:
My current thinking for collation sequences:

All strings being collated have had all compatibility codepoints
expanded into composed sequences.

What does the Unicode manual suggest?  (www.unicode.org reference?)


While nice, it is by no means required to keep/convert the strings to Normalizion form D, as long as the operation returns the same results as if they were (http://unicode.org/reports/tr10/#Main_Algorithm):

"Step 1. Produce a normalized form of each input string, applying S1.1.

S1.1 Use the Unicode canonical algorithm to decompose characters according to the canonical mappings. That is, put the string into Normalization Form D (see [UAX15]).

  • Conformant implementations may skip this step in certain circumstances, as long as they get the same results. For techniques that may be useful in such an approach, seeSection 6.5, Avoiding Normalization."
So it's a tradeoff between whether you complicate other string operations that create/modify strings to maintain normalized data, or take the hit when collating.

Cheers,
Henry 



signature.asc (859 bytes) Download Attachment