Re: [Pharo-dev] Unicode Support - Normalization NFD -- #decomposedStringWithCanonicalMapping

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Unicode Support - Normalization NFD -- #decomposedStringWithCanonicalMapping

Hannes Hirzel
On 12/13/15, stepharo <[hidden email]> wrote:
>
>> It is best and (mostly) correct to think of a Unicode string as a sequence
>> of Unicode characters, each defined/identified by a code point (out of
>> 10.000s covering all languages). That is what we have today in Pharo (with
>> the distinction between ByteString and WideString as mostly invisible
>> implementation details).
>>
>> To encode Unicode for external representation as bytes, we use UTF-8 like
>> the rest of the modern world.

>> So far, so good.
>>
...>
>
> like me ;)
> I will wait for a conclusion with code :)
>
> Stef
>

Some code with illustration by the e acute example see below

>> Why then is there confusion about the seemingly simple concept of a
>> character ? Because Unicode allows different ways to say the same thing.
>> The simplest example in a common language is (the French letter é) is
>>
>> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>>
>> which can also be written as
>>
>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]
>>
>> The former being a composed normal form, the latter a decomposed normal
>> form. (And yes, it is even much more complicated than that, it goes on for
>> 1000s of pages).
>>
>> In the above example, the concept of character/string is indeed fuzzy.
>>
>> HTH,
>>
>> Sven

The text below shows how to deal with the  Unicode e acute example
brought up by Sven in terms of comparing strings. Currently Pharo and
Cuis do not do Normalization of strings. Limited support is in Squeak.
It will be shown how NFD normalization may be implemented.


Swift programming language
-----------------------------------------

How does the Swift programming language [1] deal with Unicode strings?

 // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE
    let eAcuteQuestion = "Voulez-vous un caf\u{E9}?"

    // "Voulez-vous un cafe&#769;?" using LATIN SMALL LETTER E and
COMBINING ACUTE ACCENT
    let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?"

    if eAcuteQuestion == combinedEAcuteQuestion {
    print("These two strings are considered equal")
    }
    // prints "These two strings are considered equal"

The equality operator uses the NFD (Normalization Form Decomposed)[2]
form for the comparison appyling a method
#decomposedStringWithCanonicalMapping[3]


Squeak / Pharo
-----------------------

Comparison without NFD [3]


"Voulez-vous un café?"
eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
asString, '?'.


eAcuteQuestion = combinedEAcuteQuestion
 false

eAcuteQuestion == combinedEAcuteQuestion
 false

The result is false. A Unicode conformant application however should
return *true*.

Reason for this is that  Squeak / Pharo strings are not put into NFD
before  testing for equality =


Squeak Unicode strings may be tested for Unicode conformant equality
by converting them to NFD before testing.



Squeak using NFD

asDecomposedUnicode[4] transforms a string into NFD for cases where a
Unicode code point if decomposed, is decomposed only to two code
points [5]. This is so because when initializing [6] the Unicode
Character Database in Squeak this is a limitation imposed by the code
which reads UnicodeData.txt [7][8]. This is not a necessary
limitation. The code may be rewritten at the price of a more complex
implementation of #asDecomposedUnicode.

"Voulez-vous un café?"
eAcuteQuestion  := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'.
combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter
asString, '?'.


eAcuteQuestion asDecomposedUnicode =
    combinedEAcuteQuestion  asDecomposedUnicode

 true



Conclusion
------------------

Implementing a method like #decomposedStringWithCanonicalMapping
(swift) which puts a string into NFD (Normalization Form D) is an
important building block towards better Unicode compliance. A Squeak
proposal is given by [4]. It needs to be reviewed.extended.

It should probably  be extended for cases where there are more than
two code points in the decomposed form (3 or more?)

The implementing of NFD comparison gives us an equality test for a
comparatively small effort for simple cases covering a large number of
use cases (Languages using the Latin script).

The algorithm is table driven by the UCD [8]. From this follows an
simple but important fact for conformant implementations need runtime
access to information from the Unicode Character Database [UCD][9].


[1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285
[2] http://www.unicode.org/glossary/#normalization_form_d
[3] https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping
[4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250
[5] http://www.unicode.org/glossary/#code_point
[6] Unicode initialize http://wiki.squeak.org/squeak/6248
[7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
[8] Unicode Character Database documentation http://unicode.org/reports/tr44/
[9] http://www.unicode.org/reports/tr23/