Hannes,
The Unicode standard provide compatibility codepoints for compatibility purposes and prefers all characters to be represented composed form - as that way they are comparable and sortable. (Some composed characters have *more than one* compatibility codepoint. The canonical example is the composed character #(0041 030a) which can be represented by EITHER the compatibility codepoint #(00c5) "Latin Capital Letter A with Ring" above OR by #(212b) "Angstrom sign" ) On 7 December 2015 at 08:17, H. Hirzel <[hidden email]> wrote: > On 12/7/15, EuanM <[hidden email]> wrote: >> My current thinking for collation sequences: >> >> All strings being collated have had all compatibility codepoints >> expanded into composed sequences. > > What does the Unicode manual suggest? (www.unicode.org reference?) > > > >> >> Strings containing composed sequences and UTF >> -8 strings containing multi-byte characters have these represented by >> a very short ordered collection in place of the single Byte of a >> ByteString. >> >> When we sort characters, words or phrases of strings that contain >> zero compatibility codepoints, we simply pull a pre-defined sortBlock >> out of a Dictionary of pre-defined sortBlocks >> >> >> aDictionaryOfSorts at: ukPhoneBook put: aSortBlock >> or >> aDictionaryOfSorts at: ukPhoneBook put: '[aString representing the >> code of a sortBlock]' . >> >> ASortedCollectionOfUtf8Strings sortBlock: aDictionaryOfSorts at: >> ukPhoneBook >> >> - or some actual working code! :-) >> > > Yes, focusing on this is a real need. > > >> >> On 6 December 2015 at 15:14, H. Hirzel <[hidden email]> wrote: >>> P.S. The 30-bit value for each character in Squeak/Pharo (if necessary >>> together with an additional language tag) is a potentially very >>> capable infrastructure. Not really used much at the moment. >>> >>> The challenge is to to make _existing_ Unicode-know-how defined >>> elsewhere (e.g. www.unicode.org) available in Squeak/Pharo/Cuis. >>> >>> Most simple cases would be to start with collation sequences in >>> Italian, French, German, Spanish, Portugese. Later on more complex >>> cases like Arabic. >>> >>> --HH >>> >>> On 12/6/15, H. Hirzel <[hidden email]> wrote: >>>> Hi Euan, >>>> >>>> On 12/4/15, EuanM <[hidden email]> wrote: >>>>> I'm currently groping my way to seeing how feature-complete our >>>>> Unicode support is. I am doing this to establish what still needs to >>>>> be done to provide full Unicode support. >>>>> >>>>> This seems to me to be an area where it would be best to write it >>>>> once, and then have the same codebase incorporated into the Smalltalks >>>>> that most share a common ancestry. >>>>> >>>>> I am keen to get: equality-testing for strings; sortability for >>>>> strings which have ligatures and diacritic characters; and correct >>>>> round-tripping of data. >>>> >>>> These goals call for a package with SUnit tests which you then can >>>> run on all platforms. This will be a tool to evalutate platforms for >>>> the level of Unicode support. >>>> As mentioned in the thread I would focus on UTF8 only as far as >>>> external files are concerned. >>>> I.E. the test package writes a sample UFT8 file and then reads it to >>>> do the various tests. >>>> I have started doing this for Squeak and Cuis some time ago with a few >>>> tests. >>>> >>>> I am interested in sortability. Round-tripping is fine if you go for >>>> UTF8. >>>> Important of course is which languages you think the package should >>>> work. Some of them are easy, some not. >>>> >>>> This afternoon I did some updates on the Squeak wiki >>>> http://wiki.squeak.org/squeak/recent >>>> >>>> --Hannes >>>> >>>>> >>>>> Call to action: >>>>> ========== >>>>> >>>>> If you have comments on these proposals - such as "but we already have >>>>> that facility" or "the reason we do not have these facilities is >>>>> because they are dog-slow" - please let me know them. >>>>> >>>>> If you would like to help out, please let me know. >>>>> >>>>> If you have Unicode experience and expertise, and would like to be, or >>>>> would be willing to be, in the 'council of experts' for this project, >>>>> please let me know. >>>>> >>>>> If you have comments or ideas on anything mentioned in this email >>>>> >>>>> In the first instance, the initiative's website will be: >>>>> http://smalltalk.uk.to/unicode.html >>>>> >>>>> I have created a SqueakSource.com project called UnicodeSupport >>>>> >>>>> I want to avoid re-inventing any facilities which already exist. >>>>> Except where they prevent us reaching the goals of: >>>>> - sortable UTF8 strings >>>>> - sortable UTF16 strings >>>>> - equivalence testing of 2 UTF8 strings >>>>> - equivalence testing of 2 UTF16 strings >>>>> - round-tripping UTF8 strings through Smalltalk >>>>> - roundtripping UTF16 strings through Smalltalk. >>>>> As I understand it, we have limited Unicode support atm. >>>>> >>>>> Current state of play >>>>> =============== >>>>> ByteString gets converted to WideString when need is automagically >>>>> detected. >>>>> >>>>> Is there anything else that currently exists? >>>>> >>>>> Definition of Terms >>>>> ============== >>>>> A quick definition of terms before I go any further: >>>>> >>>>> Standard terms from the Unicode standard >>>>> =============================== >>>>> a compatibility character : an additional encoding of a *normal* >>>>> character, for compatibility and round-trip conversion purposes. For >>>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>>> >>>>> Made-up terms >>>>> ============ >>>>> a convenience codepoint : a single codepoint which represents an item >>>>> that is also encoded as a string of codepoints. >>>>> >>>>> (I tend to use the terms compatibility character and compatibility >>>>> codepoint interchangably. The standard only refers to them as >>>>> compatibility characters. However, the standard is determined to >>>>> emphasise that characters are abstract and that codepoints are >>>>> concrete. So I think it is often more useful and productive to think >>>>> of compatibility or convenience codepoints). >>>>> >>>>> a composed character : a character made up of several codepoints >>>>> >>>>> Unicode encoding explained >>>>> ===================== >>>>> A convenience codepoint can therefore be thought of as a code point >>>>> used for a character which also has a composed form. >>>>> >>>>> The way Unicode works is that sometimes you can encode a character in >>>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>>> sometimes not. >>>>> >>>>> You can therefore have a long stream of ASCII which is single-byte >>>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>>> stream, it would be represented either by a compatibility character or >>>>> by a multi-byte combination. >>>>> >>>>> Using compatibility characters can prevent proper sorting and >>>>> equivalence testing. >>>>> >>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>>> and round-tripping probelms. Although avoiding them can *also* cause >>>>> compatibility issues and round-tripping problems. >>>>> >>>>> Currently my thinking is: >>>>> >>>>> a Utf8String class >>>>> an Ordered collection, with 1 byte characters as the modal element, >>>>> but short arrays of wider strings where necessary >>>>> a Utf16String class >>>>> an Ordered collection, with 2 byte characters as the modal element, >>>>> but short arrays of wider strings >>>>> beginning with a 2-byte endianness indicator. >>>>> >>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>>> compatible. >>>>> >>>>> So my thinking is that Utf8String will contain convenience codepoints, >>>>> for round-tripping. And where there are multiple convenience >>>>> codepoints for a character, that it standardises on one. >>>>> >>>>> And that there is a Utf8SortableString which uses *only* normal >>>>> characters. >>>>> >>>>> We then need methods to convert between the two. >>>>> >>>>> aUtf8String asUtf8SortableString >>>>> >>>>> and >>>>> >>>>> aUtf8SortableString asUtf8String >>>>> >>>>> >>>>> Sort orders are culture and context dependent - Sweden and Germany >>>>> have different sort orders for the same diacritic-ed characters. Some >>>>> countries have one order in general usage, and another for specific >>>>> usages, such as phone directories (e.g. UK and France) >>>>> >>>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>>> conversion methods >>>>> >>>>> A list of sorted words would be a SortedCollection, and there could be >>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>>> seOrder, ukOrder, etc >>>>> >>>>> along the lines of >>>>> aListOfWords := SortedCollection sortBlock: deOrder >>>>> >>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>>> then we can perform equivalence testing on them trivially. >>>>> >>>>> To make sure a Utf8String is well formed, we would need to have a way >>>>> of cleaning up any convenience codepoints which were valid, but which >>>>> were for a character which has multiple equally-valid alternative >>>>> convenience codepoints, and for which the string currently had the >>>>> "wrong" convenience codepoint. (i.e for any character with valid >>>>> alternative convenience codepoints, we would choose one to be in the >>>>> well-formed Utf8String, and we would need a method for cleaning the >>>>> alternative convenience codepoints out of the string, and replacing >>>>> them with the chosen approved convenience codepoint. >>>>> >>>>> aUtf8String cleanUtf8String >>>>> >>>>> With WideString, a lot of the issues disappear - except >>>>> round-tripping(although I'm sure I have seen something recently about >>>>> 4-byte strings that also have an additional bit. Which would make >>>>> some Unicode characters 5-bytes long.) >>>>> >>>>> >>>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>>> subtle, or somewhere in between, please let me know) >>>>> >>>>> Cheers, >>>>> Euan >>>>> >>>>> >>>> >> |
While nice, it is by no means required to keep/convert the strings to Normalizion form D, as long as the operation returns the same results as if they were (http://unicode.org/reports/tr10/#Main_Algorithm): "Step 1. Produce a normalized form of each input string, applying S1.1. S1.1 Use the Unicode canonical algorithm to decompose characters according to the canonical mappings. That is, put the string into Normalization Form D (see [UAX15]).
So it's a tradeoff between whether you complicate other string operations that create/modify strings to maintain normalized data, or take the hit when collating. Cheers, Henry signature.asc (859 bytes) Download Attachment |
Free forum by Nabble | Edit this page |