On 12/13/15, stepharo <[hidden email]> wrote:
> >> It is best and (mostly) correct to think of a Unicode string as a sequence >> of Unicode characters, each defined/identified by a code point (out of >> 10.000s covering all languages). That is what we have today in Pharo (with >> the distinction between ByteString and WideString as mostly invisible >> implementation details). >> >> To encode Unicode for external representation as bytes, we use UTF-8 like >> the rest of the modern world. >> So far, so good. >> ...> > > like me ;) > I will wait for a conclusion with code :) > > Stef > Some code with illustration by the e acute example see below >> Why then is there confusion about the seemingly simple concept of a >> character ? Because Unicode allows different ways to say the same thing. >> The simplest example in a common language is (the French letter é) is >> >> LATIN SMALL LETTER E WITH ACUTE [U+00E9] >> >> which can also be written as >> >> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301] >> >> The former being a composed normal form, the latter a decomposed normal >> form. (And yes, it is even much more complicated than that, it goes on for >> 1000s of pages). >> >> In the above example, the concept of character/string is indeed fuzzy. >> >> HTH, >> >> Sven The text below shows how to deal with the Unicode e acute example brought up by Sven in terms of comparing strings. Currently Pharo and Cuis do not do Normalization of strings. Limited support is in Squeak. It will be shown how NFD normalization may be implemented. Swift programming language ----------------------------------------- How does the Swift programming language [1] deal with Unicode strings? // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE let eAcuteQuestion = "Voulez-vous un caf\u{E9}?" // "Voulez-vous un café?" using LATIN SMALL LETTER E and COMBINING ACUTE ACCENT let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?" if eAcuteQuestion == combinedEAcuteQuestion { print("These two strings are considered equal") } // prints "These two strings are considered equal" The equality operator uses the NFD (Normalization Form Decomposed)[2] form for the comparison appyling a method #decomposedStringWithCanonicalMapping[3] Squeak / Pharo ----------------------- Comparison without NFD [3] "Voulez-vous un café?" eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter asString, '?'. eAcuteQuestion = combinedEAcuteQuestion false eAcuteQuestion == combinedEAcuteQuestion false The result is false. A Unicode conformant application however should return *true*. Reason for this is that Squeak / Pharo strings are not put into NFD before testing for equality = Squeak Unicode strings may be tested for Unicode conformant equality by converting them to NFD before testing. Squeak using NFD asDecomposedUnicode[4] transforms a string into NFD for cases where a Unicode code point if decomposed, is decomposed only to two code points [5]. This is so because when initializing [6] the Unicode Character Database in Squeak this is a limitation imposed by the code which reads UnicodeData.txt [7][8]. This is not a necessary limitation. The code may be rewritten at the price of a more complex implementation of #asDecomposedUnicode. "Voulez-vous un café?" eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter asString, '?'. eAcuteQuestion asDecomposedUnicode = combinedEAcuteQuestion asDecomposedUnicode true Conclusion ------------------ Implementing a method like #decomposedStringWithCanonicalMapping (swift) which puts a string into NFD (Normalization Form D) is an important building block towards better Unicode compliance. A Squeak proposal is given by [4]. It needs to be reviewed.extended. It should probably be extended for cases where there are more than two code points in the decomposed form (3 or more?) The implementing of NFD comparison gives us an equality test for a comparatively small effort for simple cases covering a large number of use cases (Languages using the Latin script). The algorithm is table driven by the UCD [8]. From this follows an simple but important fact for conformant implementations need runtime access to information from the Unicode Character Database [UCD][9]. [1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285 [2] http://www.unicode.org/glossary/#normalization_form_d [3] https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250 [5] http://www.unicode.org/glossary/#code_point [6] Unicode initialize http://wiki.squeak.org/squeak/6248 [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt [8] Unicode Character Database documentation http://unicode.org/reports/tr44/ [9] http://www.unicode.org/reports/tr23/ |
Free forum by Nabble | Edit this page |