Hello Sven
Thank you for your report about about your experimental, proof of concept, prototype project, that aims to improve Unicode support. Please include me in the loop. Below is is my attempt at summarizing the Unicode discussion of the last weeks. Corrections /comments / additions are welcome. Kind regards Hannes 1) There is a need for improved Unicode support implemented _within_ the image , probably as a library. 1a) This follows the example of the the Twitter CLDR library (i.e. re-implementation of ICU components for Ruby). https://github.com/twitter/twitter-cldr-rb Other languages/libraries have similar approaches - dotNet, https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx) - Python https://docs.python.org/3/howto/unicode.html - Go http://blog.golang.org/strings - Swift, https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html - Perl http://blog.golang.org/strings 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This is because of security and portability reasons (Eliot Miranda) and because of the Smalltalk approach that wants to expose basic algorithms in Smalltalk code. In addition the 16bit based ICU library does not fit well with the Squeak/Pharo UTF32 model. 2) The Unicode infrastructure (21(32) bit wide Characters as immediate objects, use of UTF-32 internally, indexable strings, UTF8 for outside communication, support of code converters) is a very valuable foundation which makes algorithms more straightforward at the expense of a more memory usage. It not used to its full potential at all currently though a lot of hard work has been done. 3) The Unicode algorithms are mostly table / database driven. This means that dictionary lookup is a prominent part of the algorithms. The essential building block for this is that the Unicode character database UCD (http://www.unicode.org/ucd/) is made available _within_ the image with the full content as needed by the target languages / scripts one wants to deal with. The process of loading the UCD should be made configurable. 3a) a lot of people are interested in the Latin script (and scripts of similar complexity) only. 3b) The UCD data in XML form http://www.unicode.org/Public/8.0.0/ucdxml/ offers a download with and without the CJK characters. 4) The next step is to implement normalization (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that you have reached results here with the test data: http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt. 5) Pharo offers nice inspectors to view dictionaries and ordered collections (table view, drill down) which facilitates the development to table driven algorithms. The data structures and algorithm are do not depend on a particular dialect though and may be ported to Squeak or Cuis. 6) After having implemented normalization, comparison may be implemented. This needs CLDR access (collation, Unicode Common Locale Data Repository, http://cldr.unicode.org/ ). 7) An architecture has the following subsystems 7a) Basic character handling (21(32)bit characters in indexable strings, point 2) 7b) Runtime access to the Unicode Character Database (point 3) 7c) Converters 7d) Normalization (point 4) 7e) CLDR access (point 6) 8) The implementation should be driven by the current needs. An attainable next goal is to release 8a) a StringBuilder utility class for easier construction of test strings i.e. instead of > normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 > 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: > String). do normalizer composeString: (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee') and construct some test cases with it which illustrate some basic Unicode issues. 8b) identity testing for major languages (e.g. French, German, Spanish) and scripts of similar complexity. I 8c) to provide some more documentation of past and concurrent efforts. Note: This summary has only covered string manipulation, not rendering on the screen which is a different issue. On 12/16/15, Sven Van Caekenberghe <[hidden email]> wrote: > Hi Hannes, > > My detailed comments/answers below, after quoting 2 of your emails: > >> On 10 Dec 2015, at 22:17, H. Hirzel <[hidden email]> wrote: >> >> Hello Sven >> >> On 12/9/15, Sven Van Caekenberghe <[hidden email]> wrote: >> >>> The simplest example in a common language is (the French letter é) is >>> >>> LATIN SMALL LETTER E WITH ACUTE [U+00E9] >>> >>> which can also be written as >>> >>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT >>> [U+0301] >>> >>> The former being a composed normal form, the latter a decomposed normal >>> form. (And yes, it is even much more complicated than that, it goes on >>> for >>> 1000s of pages). >>> >>> In the above example, the concept of character/string is indeed fuzzy. >>> >>> HTH, >>> >>> Sven >> >> Thanks for this example. I have created a wiki page with it >> >> I wonder what the Pharo equivalent is of the following Squeak expression >> >> $é asString asDecomposedUnicode >> >> Regards >> >> Hannes > > You also wrote: > >> The text below shows how to deal with the Unicode e acute example >> brought up by Sven in terms of comparing strings. Currently Pharo and >> Cuis do not do Normalization of strings. Limited support is in Squeak. >> It will be shown how NFD normalization may be implemented. >> >> >> Swift programming language >> ----------------------------------------- >> >> How does the Swift programming language [1] deal with Unicode strings? >> >> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE >> let eAcuteQuestion = "Voulez-vous un caf\u{E9}?" >> >> // "Voulez-vous un café?" using LATIN SMALL LETTER E and >> COMBINING ACUTE ACCENT >> let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?" >> >> if eAcuteQuestion == combinedEAcuteQuestion { >> print("These two strings are considered equal") >> } >> // prints "These two strings are considered equal" >> >> The equality operator uses the NFD (Normalization Form Decomposed)[2] >> form for the comparison appyling a method >> #decomposedStringWithCanonicalMapping[3] >> >> >> Squeak / Pharo >> ----------------------- >> >> Comparison without NFD [3] >> >> >> "Voulez-vous un café?" >> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >> asString, '?'. >> >> >> eAcuteQuestion = combinedEAcuteQuestion >> false >> >> eAcuteQuestion == combinedEAcuteQuestion >> false >> >> The result is false. A Unicode conformant application however should >> return *true*. >> >> Reason for this is that Squeak / Pharo strings are not put into NFD >> before testing for equality = >> >> >> Squeak Unicode strings may be tested for Unicode conformant equality >> by converting them to NFD before testing. >> >> >> >> Squeak using NFD >> >> asDecomposedUnicode[4] transforms a string into NFD for cases where a >> Unicode code point if decomposed, is decomposed only to two code >> points [5]. This is so because when initializing [6] the Unicode >> Character Database in Squeak this is a limitation imposed by the code >> which reads UnicodeData.txt [7][8]. This is not a necessary >> limitation. The code may be rewritten at the price of a more complex >> implementation of #asDecomposedUnicode. >> >> "Voulez-vous un café?" >> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >> asString, '?'. >> >> >> eAcuteQuestion asDecomposedUnicode = >> combinedEAcuteQuestion asDecomposedUnicode >> >> true >> >> >> >> Conclusion >> ------------------ >> >> Implementing a method like #decomposedStringWithCanonicalMapping >> (swift) which puts a string into NFD (Normalization Form D) is an >> important building block towards better Unicode compliance. A Squeak >> proposal is given by [4]. It needs to be reviewed.extended. >> >> It should probably be extended for cases where there are more than >> two code points in the decomposed form (3 or more?) >> >> The implementing of NFD comparison gives us an equality test for a >> comparatively small effort for simple cases covering a large number of >> use cases (Languages using the Latin script). >> >> The algorithm is table driven by the UCD [8]. From this follows an >> simple but important fact for conformant implementations need runtime >> access to information from the Unicode Character Database [UCD][9]. >> >> >> [1] >> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285 >> [2] http://www.unicode.org/glossary/#normalization_form_d >> [3] >> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping >> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250 >> [5] http://www.unicode.org/glossary/#code_point >> [6] Unicode initialize http://wiki.squeak.org/squeak/6248 >> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt >> [8] Unicode Character Database documentation >> http://unicode.org/reports/tr44/ >> [9] http://www.unicode.org/reports/tr23/ > > > Today, we have a Unicode and CombinedCharacter class in Pharo, and there is > different but similar Unicode code in Squeak. These are too simple (even > though they might work, partially). > > The scope of the original threads is way too wide: a new string type, > normalisation, collation, being cross dialect, mixing all kinds of character > and encoding definitions. All interesting, but not much will come out of it. > But the point that we cannot leave proper text string handling to an outside > library is indeed key. > > That is why a couple of people in the Pharo community (myself included) > started an experimental, proof of concept, prototype project, that aims to > improve Unicode support. We will announce it to a wider public when we feel > we have something to show for. The goal is in the first place to understand > and implement the fundamental algorithms, starting with the 4 forms of > Normalisation. But we're working on collation/sorting too. > > This work is of course being done for/in Pharo, using some of the facilities > only available there. It probably won't be difficult to port, but we can't > be bothered with probability right now. > > What we started with is loading UCD data and making it available as a nice > objects (30.000 of them). > > So now you can do things like > > $é unicodeCharacterData. > > => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)" > > $é unicodeCharacterData uppercase asCharacter. > > => "$É" > > $é unicodeCharacterData decompositionMapping. > > => "#(101 769)" > > There is also a cool GT Inspector view: > > > > Next we started implementing a normaliser. It was rather easy to get support > for simpler languages going. The next code snippets use explicit code > arrays, because copying decomposed diacritics to my mail client does not > work (they get automatically composed), in a Pharo Workspace this does work > nicely with plain strings. The higher numbers are the diacritics. > > (normalizer decomposeString: 'les élèves Français') collect: #codePoint as: > Array. > > => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99 > 807 97 105 115)" > > (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint > as: Array. > > => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103 > 115 97 108 108 101 101)" > > normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115 > 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String). > > => "'les élèves Français'" > > normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 > 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: > String). > > => "'Düsseldorf Königsallee'" > > However, the real algorithm following the official specification (and other > elements of Unicode that interact with it) is way more complicated (think > about all those special languages/scripts out there). We're focused on > understanding/implementing that now. > > Next, unit tests were added (of course). As well as a test that uses > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about > 75.000 individual test cases to check conformance to the official Unicode > Normalization specification. > > Right now (with super cool hangul / jamo code by Henrik), we hit the > following stats: > > #testNFC 16998/18593 (91.42%) > #testNFD 16797/18593 (90.34%) > #testNFKC 13321/18593 (71.65%) > #testNFKD 16564/18593 (89.09%) > > Way better than the naive implementations, but not yet there. > > We are also experimenting and thinking a lot about how to best implement all > this, trying out different models/ideas/apis/representations. > > It will move slowly, but you will hear from us again in the coming > weeks/months. > > Sven > > PS: Pharo developers with a good understanding of this subject area that > want to help, let me know and we'll put you in the loop. Hacking and > specification reading are required ;-) > > |
Isn't it strange that the development is not open source?
Levente On Fri, 18 Dec 2015, H. Hirzel wrote: > Hello Sven > > Thank you for your report about about your experimental, proof of > concept, prototype project, that aims to improve Unicode support. > Please include me in the loop. > > Below is is my attempt at summarizing the Unicode discussion of the last weeks. > Corrections /comments / additions are welcome. > > Kind regards > > Hannes > > > 1) There is a need for improved Unicode support implemented _within_ > the image , probably as a library. > > 1a) This follows the example of the the Twitter CLDR library (i.e. > re-implementation of ICU components for Ruby). > https://github.com/twitter/twitter-cldr-rb > > Other languages/libraries have similar approaches > - dotNet, https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx) > - Python https://docs.python.org/3/howto/unicode.html > - Go http://blog.golang.org/strings > - Swift, https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html > - Perl http://blog.golang.org/strings > > 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This > is because of security and portability reasons (Eliot Miranda) and > because of the Smalltalk approach that wants to expose basic > algorithms in Smalltalk code. In addition the 16bit based ICU library > does not fit well with the Squeak/Pharo UTF32 model. > > 2) The Unicode infrastructure (21(32) bit wide Characters as immediate > objects, use of UTF-32 internally, indexable strings, UTF8 for outside > communication, support of code converters) is a very valuable > foundation which makes algorithms more straightforward at the expense > of a more memory usage. It not used to its full potential at all > currently though a lot of hard work has been done. > > 3) The Unicode algorithms are mostly table / database driven. This > means that dictionary lookup is a prominent part of the algorithms. > The essential building block for this is that the Unicode character > database UCD (http://www.unicode.org/ucd/) is made available > _within_ the image with the full content as needed by the target > languages / scripts one wants to deal with. The process of loading the > UCD should be made configurable. > > 3a) a lot of people are interested in the Latin script (and scripts of > similar complexity) only. > 3b) The UCD data in XML form > http://www.unicode.org/Public/8.0.0/ucdxml/ offers a download with > and without the CJK characters. > > 4) The next step is to implement normalization > (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that > you have reached results here with the test data: > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt. > > 5) Pharo offers nice inspectors to view dictionaries and ordered > collections (table view, drill down) which facilitates the development > to table driven algorithms. The data structures and algorithm are do > not depend on a particular dialect though and may be ported to Squeak > or Cuis. > > 6) After having implemented normalization, comparison may be > implemented. This needs CLDR access (collation, Unicode Common Locale > Data Repository, http://cldr.unicode.org/ ). > > > 7) An architecture has the following subsystems > > 7a) Basic character handling (21(32)bit characters in indexable > strings, point 2) > 7b) Runtime access to the Unicode Character Database (point 3) > 7c) Converters > 7d) Normalization (point 4) > 7e) CLDR access (point 6) > > > 8) The implementation should be driven by the current needs. > > An attainable next goal is to release > > 8a) a StringBuilder utility class for easier construction of test strings > i.e. instead of > >> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >> String). > > do > normalizer composeString: > (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee') > > and construct some test cases with it which illustrate some basic > Unicode issues. > > 8b) identity testing for major languages (e.g. French, German, > Spanish) and scripts of similar complexity. I > > 8c) to provide some more documentation of past and concurrent efforts. > > Note: This summary has only covered string manipulation, not rendering > on the screen which is a different issue. > > > On 12/16/15, Sven Van Caekenberghe <[hidden email]> wrote: >> Hi Hannes, >> >> My detailed comments/answers below, after quoting 2 of your emails: >> >>> On 10 Dec 2015, at 22:17, H. Hirzel <[hidden email]> wrote: >>> >>> Hello Sven >>> >>> On 12/9/15, Sven Van Caekenberghe <[hidden email]> wrote: >>> >>>> The simplest example in a common language is (the French letter é) is >>>> >>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9] >>>> >>>> which can also be written as >>>> >>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT >>>> [U+0301] >>>> >>>> The former being a composed normal form, the latter a decomposed normal >>>> form. (And yes, it is even much more complicated than that, it goes on >>>> for >>>> 1000s of pages). >>>> >>>> In the above example, the concept of character/string is indeed fuzzy. >>>> >>>> HTH, >>>> >>>> Sven >>> >>> Thanks for this example. I have created a wiki page with it >>> >>> I wonder what the Pharo equivalent is of the following Squeak expression >>> >>> $é asString asDecomposedUnicode >>> >>> Regards >>> >>> Hannes >> >> You also wrote: >> >>> The text below shows how to deal with the Unicode e acute example >>> brought up by Sven in terms of comparing strings. Currently Pharo and >>> Cuis do not do Normalization of strings. Limited support is in Squeak. >>> It will be shown how NFD normalization may be implemented. >>> >>> >>> Swift programming language >>> ----------------------------------------- >>> >>> How does the Swift programming language [1] deal with Unicode strings? >>> >>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE >>> let eAcuteQuestion = "Voulez-vous un caf\u{E9}?" >>> >>> // "Voulez-vous un café?" using LATIN SMALL LETTER E and >>> COMBINING ACUTE ACCENT >>> let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?" >>> >>> if eAcuteQuestion == combinedEAcuteQuestion { >>> print("These two strings are considered equal") >>> } >>> // prints "These two strings are considered equal" >>> >>> The equality operator uses the NFD (Normalization Form Decomposed)[2] >>> form for the comparison appyling a method >>> #decomposedStringWithCanonicalMapping[3] >>> >>> >>> Squeak / Pharo >>> ----------------------- >>> >>> Comparison without NFD [3] >>> >>> >>> "Voulez-vous un café?" >>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>> asString, '?'. >>> >>> >>> eAcuteQuestion = combinedEAcuteQuestion >>> false >>> >>> eAcuteQuestion == combinedEAcuteQuestion >>> false >>> >>> The result is false. A Unicode conformant application however should >>> return *true*. >>> >>> Reason for this is that Squeak / Pharo strings are not put into NFD >>> before testing for equality = >>> >>> >>> Squeak Unicode strings may be tested for Unicode conformant equality >>> by converting them to NFD before testing. >>> >>> >>> >>> Squeak using NFD >>> >>> asDecomposedUnicode[4] transforms a string into NFD for cases where a >>> Unicode code point if decomposed, is decomposed only to two code >>> points [5]. This is so because when initializing [6] the Unicode >>> Character Database in Squeak this is a limitation imposed by the code >>> which reads UnicodeData.txt [7][8]. This is not a necessary >>> limitation. The code may be rewritten at the price of a more complex >>> implementation of #asDecomposedUnicode. >>> >>> "Voulez-vous un café?" >>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>> asString, '?'. >>> >>> >>> eAcuteQuestion asDecomposedUnicode = >>> combinedEAcuteQuestion asDecomposedUnicode >>> >>> true >>> >>> >>> >>> Conclusion >>> ------------------ >>> >>> Implementing a method like #decomposedStringWithCanonicalMapping >>> (swift) which puts a string into NFD (Normalization Form D) is an >>> important building block towards better Unicode compliance. A Squeak >>> proposal is given by [4]. It needs to be reviewed.extended. >>> >>> It should probably be extended for cases where there are more than >>> two code points in the decomposed form (3 or more?) >>> >>> The implementing of NFD comparison gives us an equality test for a >>> comparatively small effort for simple cases covering a large number of >>> use cases (Languages using the Latin script). >>> >>> The algorithm is table driven by the UCD [8]. From this follows an >>> simple but important fact for conformant implementations need runtime >>> access to information from the Unicode Character Database [UCD][9]. >>> >>> >>> [1] >>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285 >>> [2] http://www.unicode.org/glossary/#normalization_form_d >>> [3] >>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping >>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250 >>> [5] http://www.unicode.org/glossary/#code_point >>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248 >>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt >>> [8] Unicode Character Database documentation >>> http://unicode.org/reports/tr44/ >>> [9] http://www.unicode.org/reports/tr23/ >> >> >> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is >> different but similar Unicode code in Squeak. These are too simple (even >> though they might work, partially). >> >> The scope of the original threads is way too wide: a new string type, >> normalisation, collation, being cross dialect, mixing all kinds of character >> and encoding definitions. All interesting, but not much will come out of it. >> But the point that we cannot leave proper text string handling to an outside >> library is indeed key. >> >> That is why a couple of people in the Pharo community (myself included) >> started an experimental, proof of concept, prototype project, that aims to >> improve Unicode support. We will announce it to a wider public when we feel >> we have something to show for. The goal is in the first place to understand >> and implement the fundamental algorithms, starting with the 4 forms of >> Normalisation. But we're working on collation/sorting too. >> >> This work is of course being done for/in Pharo, using some of the facilities >> only available there. It probably won't be difficult to port, but we can't >> be bothered with probability right now. >> >> What we started with is loading UCD data and making it available as a nice >> objects (30.000 of them). >> >> So now you can do things like >> >> $é unicodeCharacterData. >> >> => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)" >> >> $é unicodeCharacterData uppercase asCharacter. >> >> => "$É" >> >> $é unicodeCharacterData decompositionMapping. >> >> => "#(101 769)" >> >> There is also a cool GT Inspector view: >> >> >> >> Next we started implementing a normaliser. It was rather easy to get support >> for simpler languages going. The next code snippets use explicit code >> arrays, because copying decomposed diacritics to my mail client does not >> work (they get automatically composed), in a Pharo Workspace this does work >> nicely with plain strings. The higher numbers are the diacritics. >> >> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as: >> Array. >> >> => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99 >> 807 97 105 115)" >> >> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint >> as: Array. >> >> => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103 >> 115 97 108 108 101 101)" >> >> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115 >> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String). >> >> => "'les élèves Français'" >> >> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >> String). >> >> => "'Düsseldorf Königsallee'" >> >> However, the real algorithm following the official specification (and other >> elements of Unicode that interact with it) is way more complicated (think >> about all those special languages/scripts out there). We're focused on >> understanding/implementing that now. >> >> Next, unit tests were added (of course). As well as a test that uses >> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about >> 75.000 individual test cases to check conformance to the official Unicode >> Normalization specification. >> >> Right now (with super cool hangul / jamo code by Henrik), we hit the >> following stats: >> >> #testNFC 16998/18593 (91.42%) >> #testNFD 16797/18593 (90.34%) >> #testNFKC 13321/18593 (71.65%) >> #testNFKD 16564/18593 (89.09%) >> >> Way better than the naive implementations, but not yet there. >> >> We are also experimenting and thinking a lot about how to best implement all >> this, trying out different models/ideas/apis/representations. >> >> It will move slowly, but you will hear from us again in the coming >> weeks/months. >> >> Sven >> >> PS: Pharo developers with a good understanding of this subject area that >> want to help, let me know and we'll put you in the loop. Hacking and >> specification reading are required ;-) >> >> > > |
In reply to this post by Hannes Hirzel
So a lot of Windows APIs require UTF-16. What's up with UTF-8 being the
only choice mentioned for external communication? Unicode string encodings like UTF-* and strings of "characters" (that is, sequences of Unicode code points) should be clearly distinguished. Do you really mean "UTF-32", or do you mean "UCS-4"? Even those two are not exactly the same. On 12/18/15 5:47 , H. Hirzel wrote: > Hello Sven > > Thank you for your report about about your experimental, proof of > concept, prototype project, that aims to improve Unicode support. > Please include me in the loop. > > Below is is my attempt at summarizing the Unicode discussion of the last weeks. > Corrections /comments / additions are welcome. > > Kind regards > > Hannes > > > 1) There is a need for improved Unicode support implemented _within_ > the image , probably as a library. > > 1a) This follows the example of the the Twitter CLDR library (i.e. > re-implementation of ICU components for Ruby). > https://github.com/twitter/twitter-cldr-rb > > Other languages/libraries have similar approaches > - dotNet, https://msdn.microsoft.com/en-us/library/System.Globalization.CharUnicodeInfo%28v=vs.110%29.aspx) > - Python https://docs.python.org/3/howto/unicode.html > - Go http://blog.golang.org/strings > - Swift, https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html > - Perl http://blog.golang.org/strings > > 1b) ICU is _not_ the way to go (http://site.icu-project.org/) . This > is because of security and portability reasons (Eliot Miranda) and > because of the Smalltalk approach that wants to expose basic > algorithms in Smalltalk code. In addition the 16bit based ICU library > does not fit well with the Squeak/Pharo UTF32 model. > > 2) The Unicode infrastructure (21(32) bit wide Characters as immediate > objects, use of UTF-32 internally, indexable strings, UTF8 for outside > communication, support of code converters) is a very valuable > foundation which makes algorithms more straightforward at the expense > of a more memory usage. It not used to its full potential at all > currently though a lot of hard work has been done. > > 3) The Unicode algorithms are mostly table / database driven. This > means that dictionary lookup is a prominent part of the algorithms. > The essential building block for this is that the Unicode character > database UCD (http://www.unicode.org/ucd/) is made available > _within_ the image with the full content as needed by the target > languages / scripts one wants to deal with. The process of loading the > UCD should be made configurable. > > 3a) a lot of people are interested in the Latin script (and scripts of > similar complexity) only. > 3b) The UCD data in XML form > http://www.unicode.org/Public/8.0.0/ucdxml/ offers a download with > and without the CJK characters. > > 4) The next step is to implement normalization > (http://www.unicode.org/reports/tr15/#Norm_Forms). Glad to read that > you have reached results here with the test data: > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt. > > 5) Pharo offers nice inspectors to view dictionaries and ordered > collections (table view, drill down) which facilitates the development > to table driven algorithms. The data structures and algorithm are do > not depend on a particular dialect though and may be ported to Squeak > or Cuis. > > 6) After having implemented normalization, comparison may be > implemented. This needs CLDR access (collation, Unicode Common Locale > Data Repository, http://cldr.unicode.org/ ). > > > 7) An architecture has the following subsystems > > 7a) Basic character handling (21(32)bit characters in indexable > strings, point 2) > 7b) Runtime access to the Unicode Character Database (point 3) > 7c) Converters > 7d) Normalization (point 4) > 7e) CLDR access (point 6) > > > 8) The implementation should be driven by the current needs. > > An attainable next goal is to release > > 8a) a StringBuilder utility class for easier construction of test strings > i.e. instead of > >> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >> String). > > do > normalizer composeString: > (StringBuilder construct: 'Du\u0308sseldorf Ko\u0308nigsallee') > > and construct some test cases with it which illustrate some basic > Unicode issues. > > 8b) identity testing for major languages (e.g. French, German, > Spanish) and scripts of similar complexity. I > > 8c) to provide some more documentation of past and concurrent efforts. > > Note: This summary has only covered string manipulation, not rendering > on the screen which is a different issue. > > > On 12/16/15, Sven Van Caekenberghe <[hidden email]> wrote: >> Hi Hannes, >> >> My detailed comments/answers below, after quoting 2 of your emails: >> >>> On 10 Dec 2015, at 22:17, H. Hirzel <[hidden email]> wrote: >>> >>> Hello Sven >>> >>> On 12/9/15, Sven Van Caekenberghe <[hidden email]> wrote: >>> >>>> The simplest example in a common language is (the French letter é) is >>>> >>>> LATIN SMALL LETTER E WITH ACUTE [U+00E9] >>>> >>>> which can also be written as >>>> >>>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT >>>> [U+0301] >>>> >>>> The former being a composed normal form, the latter a decomposed normal >>>> form. (And yes, it is even much more complicated than that, it goes on >>>> for >>>> 1000s of pages). >>>> >>>> In the above example, the concept of character/string is indeed fuzzy. >>>> >>>> HTH, >>>> >>>> Sven >>> >>> Thanks for this example. I have created a wiki page with it >>> >>> I wonder what the Pharo equivalent is of the following Squeak expression >>> >>> $é asString asDecomposedUnicode >>> >>> Regards >>> >>> Hannes >> >> You also wrote: >> >>> The text below shows how to deal with the Unicode e acute example >>> brought up by Sven in terms of comparing strings. Currently Pharo and >>> Cuis do not do Normalization of strings. Limited support is in Squeak. >>> It will be shown how NFD normalization may be implemented. >>> >>> >>> Swift programming language >>> ----------------------------------------- >>> >>> How does the Swift programming language [1] deal with Unicode strings? >>> >>> // "Voulez-vous un café?" using LATIN SMALL LETTER E WITH ACUTE >>> let eAcuteQuestion = "Voulez-vous un caf\u{E9}?" >>> >>> // "Voulez-vous un café?" using LATIN SMALL LETTER E and >>> COMBINING ACUTE ACCENT >>> let combinedEAcuteQuestion = "Voulez-vous un caf\u{65}\u{301}?" >>> >>> if eAcuteQuestion == combinedEAcuteQuestion { >>> print("These two strings are considered equal") >>> } >>> // prints "These two strings are considered equal" >>> >>> The equality operator uses the NFD (Normalization Form Decomposed)[2] >>> form for the comparison appyling a method >>> #decomposedStringWithCanonicalMapping[3] >>> >>> >>> Squeak / Pharo >>> ----------------------- >>> >>> Comparison without NFD [3] >>> >>> >>> "Voulez-vous un café?" >>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>> asString, '?'. >>> >>> >>> eAcuteQuestion = combinedEAcuteQuestion >>> false >>> >>> eAcuteQuestion == combinedEAcuteQuestion >>> false >>> >>> The result is false. A Unicode conformant application however should >>> return *true*. >>> >>> Reason for this is that Squeak / Pharo strings are not put into NFD >>> before testing for equality = >>> >>> >>> Squeak Unicode strings may be tested for Unicode conformant equality >>> by converting them to NFD before testing. >>> >>> >>> >>> Squeak using NFD >>> >>> asDecomposedUnicode[4] transforms a string into NFD for cases where a >>> Unicode code point if decomposed, is decomposed only to two code >>> points [5]. This is so because when initializing [6] the Unicode >>> Character Database in Squeak this is a limitation imposed by the code >>> which reads UnicodeData.txt [7][8]. This is not a necessary >>> limitation. The code may be rewritten at the price of a more complex >>> implementation of #asDecomposedUnicode. >>> >>> "Voulez-vous un café?" >>> eAcuteQuestion := 'Voulez-vous un caf', 16rE9 asCharacter asString, '?'. >>> combinedEAcuteQuestion := 'Voulez-vous un cafe', 16r301 asCharacter >>> asString, '?'. >>> >>> >>> eAcuteQuestion asDecomposedUnicode = >>> combinedEAcuteQuestion asDecomposedUnicode >>> >>> true >>> >>> >>> >>> Conclusion >>> ------------------ >>> >>> Implementing a method like #decomposedStringWithCanonicalMapping >>> (swift) which puts a string into NFD (Normalization Form D) is an >>> important building block towards better Unicode compliance. A Squeak >>> proposal is given by [4]. It needs to be reviewed.extended. >>> >>> It should probably be extended for cases where there are more than >>> two code points in the decomposed form (3 or more?) >>> >>> The implementing of NFD comparison gives us an equality test for a >>> comparatively small effort for simple cases covering a large number of >>> use cases (Languages using the Latin script). >>> >>> The algorithm is table driven by the UCD [8]. From this follows an >>> simple but important fact for conformant implementations need runtime >>> access to information from the Unicode Character Database [UCD][9]. >>> >>> >>> [1] >>> https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285 >>> [2] http://www.unicode.org/glossary/#normalization_form_d >>> [3] >>> https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/index.html#//apple_ref/occ/instm/NSString/decomposedStringWithCanonicalMapping >>> [4] String asDecomposedUnicode http://wiki.squeak.org/squeak/6250 >>> [5] http://www.unicode.org/glossary/#code_point >>> [6] Unicode initialize http://wiki.squeak.org/squeak/6248 >>> [7] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt >>> [8] Unicode Character Database documentation >>> http://unicode.org/reports/tr44/ >>> [9] http://www.unicode.org/reports/tr23/ >> >> >> Today, we have a Unicode and CombinedCharacter class in Pharo, and there is >> different but similar Unicode code in Squeak. These are too simple (even >> though they might work, partially). >> >> The scope of the original threads is way too wide: a new string type, >> normalisation, collation, being cross dialect, mixing all kinds of character >> and encoding definitions. All interesting, but not much will come out of it. >> But the point that we cannot leave proper text string handling to an outside >> library is indeed key. >> >> That is why a couple of people in the Pharo community (myself included) >> started an experimental, proof of concept, prototype project, that aims to >> improve Unicode support. We will announce it to a wider public when we feel >> we have something to show for. The goal is in the first place to understand >> and implement the fundamental algorithms, starting with the 4 forms of >> Normalisation. But we're working on collation/sorting too. >> >> This work is of course being done for/in Pharo, using some of the facilities >> only available there. It probably won't be difficult to port, but we can't >> be bothered with probability right now. >> >> What we started with is loading UCD data and making it available as a nice >> objects (30.000 of them). >> >> So now you can do things like >> >> $é unicodeCharacterData. >> >> => "U+00E9 LATIN SMALL LETTER E WITH ACUTE (LATIN SMALL LETTER E ACUTE)" >> >> $é unicodeCharacterData uppercase asCharacter. >> >> => "$É" >> >> $é unicodeCharacterData decompositionMapping. >> >> => "#(101 769)" >> >> There is also a cool GT Inspector view: >> >> >> >> Next we started implementing a normaliser. It was rather easy to get support >> for simpler languages going. The next code snippets use explicit code >> arrays, because copying decomposed diacritics to my mail client does not >> work (they get automatically composed), in a Pharo Workspace this does work >> nicely with plain strings. The higher numbers are the diacritics. >> >> (normalizer decomposeString: 'les élèves Français') collect: #codePoint as: >> Array. >> >> => "#(108 101 115 32 101 769 108 101 768 118 101 115 32 70 114 97 110 99 >> 807 97 105 115)" >> >> (normalizer decomposeString: 'Düsseldorf Königsallee') collect: #codePoint >> as: Array. >> >> => "#(68 117 776 115 115 101 108 100 111 114 102 32 75 111 776 110 105 103 >> 115 97 108 108 101 101)" >> >> normalizer composeString: (#(108 101 115 32 101 769 108 101 768 118 101 115 >> 32 70 114 97 110 99 807 97 105 115) collect: #asCharacter as: String). >> >> => "'les élèves Français'" >> >> normalizer composeString: (#(68 117 776 115 115 101 108 100 111 114 102 32 >> 75 111 776 110 105 103 115 97 108 108 101 101) collect: #asCharacter as: >> String). >> >> => "'Düsseldorf Königsallee'" >> >> However, the real algorithm following the official specification (and other >> elements of Unicode that interact with it) is way more complicated (think >> about all those special languages/scripts out there). We're focused on >> understanding/implementing that now. >> >> Next, unit tests were added (of course). As well as a test that uses >> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt to run about >> 75.000 individual test cases to check conformance to the official Unicode >> Normalization specification. >> >> Right now (with super cool hangul / jamo code by Henrik), we hit the >> following stats: >> >> #testNFC 16998/18593 (91.42%) >> #testNFD 16797/18593 (90.34%) >> #testNFKC 13321/18593 (71.65%) >> #testNFKD 16564/18593 (89.09%) >> >> Way better than the naive implementations, but not yet there. >> >> We are also experimenting and thinking a lot about how to best implement all >> this, trying out different models/ideas/apis/representations. >> >> It will move slowly, but you will hear from us again in the coming >> weeks/months. >> >> Sven >> >> PS: Pharo developers with a good understanding of this subject area that >> want to help, let me know and we'll put you in the loop. Hacking and >> specification reading are required ;-) >> >> > > . > |
Free forum by Nabble | Edit this page |