Dale
Thank you for your answer with links to the ICU library and the notes about classes in Gemstone. Noteworthy that you have a class Utf8 as a subclass of ByteArray. I understand that Gemstone uses the ICU library and thus does not implement the algorithms in Smalltalk. I am currently looking into what the ICU library provides. I found as well a Ruby library [2] which implements CLDR [3] It has methods like this "Alphabetize a list using regular Ruby sort:" $> ["Art", "Wasa", "Älg", "Ved"].sort $> ["Art", "Ved", "Wasa", "Älg"] Alphabetize a list using TwitterCLDR’s locale-aware sort: $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a $> ["Älg", "Art", "Ved", "Wasa"] I hope that given such an example it would not be too difficult to reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently the interest is in getting sorting done in a cross-dialect-way. --Hannes [2] https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby [3] Unicode Common Locale Data Repository http://cldr.unicode.org/index On 12/7/15, Dale Henrichs <[hidden email]> wrote: > Hannes, > > For GemStone, we are using the ICU library[1]. We have Unicode7, > Unicode16 and Unicode32 classes (subclasses of CharacterCollection) for > internal Strings and the class Utf8 (a subclass of ByteArray) for Utf8 > encoded strings ... > > The ICU library provides the primitive implementations for working with > the Unicode* and Utf8 classes > > When we started considering Unicode support, we looked at what it would > take to support collation - our main reason for looking at Unicode in > the first place) -- and we saw just how complicated the collation rules > can be[2], we were glad to see that someone had already done the hard > work[1]... > > Reconciling our legacy String implementations (String, DoubleByteString, > and QuadByteString) with the Unicode* classes was also interesting, > because the rules for Unicode equality and our legacy equality > implementation were not quite compatible. > > If you are interested in more information, I can share additional > details ... > > Dale > > [1] http://site.icu-project.org/ > [2] http://unicode.org/reports/tr10/ > > On 12/07/2015 11:54 AM, H. Hirzel wrote: >> Hello >> >> According to http://www.unicode.org/cldr/charts/27/collation/de.html the >> German >> phonebook sort order is >> >> a A ä Ä ą̈ Ą̈ ǟ Ǟ ạ̈ Ạ̈ ḁ̈ Ḁ̈ b B c C d D e E f F g G h H i I j J k K >> l L m M n N o O ö Ö ǫ̈ Ǫ̈ ȫ Ȫ ơ̈ Ơ̈ ợ̈ Ợ̈ ọ̈ Ọ̈ p P q Q r R s S ss ß t >> T u U ü Ü ǘ Ǘ ǜ Ǜ ǚ Ǚ ų̈ Ų̈ ǖ Ǖ ư̈ Ư̈ ự̈ Ự̈ ụ̈ Ụ̈ ṳ̈ Ṳ̈ ṷ̈ Ṷ̈ ṵ̈ Ṵ̈ v >> V w W x X y Y z Z >> >> I wonder why it looks like this. A lot of characters which never >> appear in a German text. >> >> >> For Spanish there is 'traditional' and 'standard' >> >> http://www.unicode.org/cldr/charts/27/collation/es.html >> >> standard a A á Á b B c C d D e E é É f F g G h H i I í Í j J k K l L m >> M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s S t T u U ú Ú >> ü Ü v V w W x X y Y z Z >> >> traditional a A á Á b B c C ch Ch CH cĥ Cĥ CĤ cȟ Cȟ CȞ cḧ Cḧ CḦ cḣ Cḣ >> CḢ cḩ Cḩ CḨ cḥ Cḥ CḤ cḫ Cḫ CḪ cẖ Cẖ d D e E é É f F g G h H i I í Í j >> J k K l L ll Ll LL lĺ Lĺ LĹ lľ Lľ LĽ lļ Lļ LĻ lḷ Lḷ LḶ lḹ Lḹ LḸ lḽ Lḽ >> LḼ lḻ Lḻ LḺ m M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s >> S t T u U ú Ú ü Ü v V w W x X y Y z Z >> >> And French is not easily found >> http://www.unicode.org/cldr/charts/27/collation/index.html >> or seems to be defined elsewhere >> >> http://unicode.org/repos/cldr/tags/release-27/common/collation/fr.xml >> >> Suggestions and hints are welcome >> >> --Hannes >> _______________________________________________ >> Cuis mailing list >> [hidden email] >> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org > > > _______________________________________________ > Cuis mailing list > [hidden email] > http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org > |
On 12/07/2015 11:31 PM, H. Hirzel wrote: > Dale > > Thank you for your answer with links to the ICU library and the notes > about classes in Gemstone. Noteworthy that you have a class Utf8 as a > subclass of ByteArray. > > I understand that Gemstone uses the ICU library and thus does not > implement the algorithms in Smalltalk. > > I am currently looking into what the ICU library provides. > > I found as well a Ruby library [2] which implements CLDR [3] > > It has methods like this > > "Alphabetize a list using regular Ruby sort:" > > $> ["Art", "Wasa", "Älg", "Ved"].sort > $> ["Art", "Ved", "Wasa", "Älg"] > > Alphabetize a list using TwitterCLDR’s locale-aware sort: > > $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a > $> ["Älg", "Art", "Ved", "Wasa"] > > I hope that given such an example it would not be too difficult to > reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently > the interest is in getting sorting done in a cross-dialect-way. > I think that the issue (from a performance perspective) is that you can't depend upon the value of the code point when doing collation --- the main algorithm[5] is pretty much table based --- In addition to the different sort orders based on characters there are even more arcane sort rules where characters at the end of a word can affect the sort order of the word (for more info see[4]). It is worth looking at the Conformance section of the Unicode spec[1] as there are different levels of collation conformance ..... ICU conforms[2] to to UTS #10[3], the highest level of conformance ... It looks like TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7]. They don't claim to be conformant to the Unicode Collation Algorithm[3], but they are covering a big chunk of the standard use cases .... Dale [1] http://unicode.org/reports/tr10/#Conformance [2] http://userguide.icu-project.org/collation [3] http://www.unicode.org/reports/tr10/ [4] http://www.unicode.org/reports/tr10/#Introduction [5] http://www.unicode.org/reports/tr10/#Main_Algorithm [6] https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby [7] http://unicode.org/reports/tr10/#Tailoring |
Dale - is that you can't depend on the value of a codepoint
*unless the string is either in fully-composed form (or has just been fully-decomposed from a fully-composed form) * OR are there circumstances where even those two cases cannot be relied upon? On 8 December 2015 at 19:20, Dale Henrichs <[hidden email]> wrote: > > > On 12/07/2015 11:31 PM, H. Hirzel wrote: >> >> Dale >> >> Thank you for your answer with links to the ICU library and the notes >> about classes in Gemstone. Noteworthy that you have a class Utf8 as a >> subclass of ByteArray. >> >> I understand that Gemstone uses the ICU library and thus does not >> implement the algorithms in Smalltalk. >> >> I am currently looking into what the ICU library provides. >> >> I found as well a Ruby library [2] which implements CLDR [3] >> >> It has methods like this >> >> "Alphabetize a list using regular Ruby sort:" >> >> $> ["Art", "Wasa", "Älg", "Ved"].sort >> $> ["Art", "Ved", "Wasa", "Älg"] >> >> Alphabetize a list using TwitterCLDR’s locale-aware sort: >> >> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a >> $> ["Älg", "Art", "Ved", "Wasa"] >> >> I hope that given such an example it would not be too difficult to >> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently >> the interest is in getting sorting done in a cross-dialect-way. >> > > I think that the issue (from a performance perspective) is that you can't > depend upon the value of the code point when doing collation --- the main > algorithm[5] is pretty much table based --- In addition to the different > sort orders based on characters there are even more arcane sort rules where > characters at the end of a word can affect the sort order of the word (for > more info see[4]). > > It is worth looking at the Conformance section of the Unicode spec[1] as > there are different levels of collation conformance ..... > > ICU conforms[2] to to UTS #10[3], the highest level of conformance ... > > It looks like TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7]. > They don't claim to be conformant to the Unicode Collation Algorithm[3], but > they are covering a big chunk of the standard use cases .... > > Dale > > [1] http://unicode.org/reports/tr10/#Conformance > [2] http://userguide.icu-project.org/collation > [3] http://www.unicode.org/reports/tr10/ > [4] http://www.unicode.org/reports/tr10/#Introduction > [5] http://www.unicode.org/reports/tr10/#Main_Algorithm > [6] > https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby > [7] http://unicode.org/reports/tr10/#Tailoring |
Euan,
What I meant is that you can't _always_ use the code point for collation, i.e., sorting based on the value of code points is not always correct[1]. If I'm not mistaken the fully-composed and fully-decomposed forms can only be used for testing the equivalence of two strings[2] ... The Main Algorithm[3], starts by producing a normalized form of the string, but the subsequent steps (produce array, form sort key and compare) involves table lookups among other things .... Once you've produced a sort key for a string, the sort key does use "binary comparison" for collating , which is a byte by byte numeric comparison ... Dale [1] http://www.unicode.org/reports/tr10/#Common_Misperceptions [2] http://unicode.org/reports/tr15/pdtr15.html [3] http://www.unicode.org/reports/tr10/#Main_Algorithm On 12/08/2015 12:22 PM, EuanM wrote: > Dale - is that you can't depend on the value of a codepoint > *unless the string is either in fully-composed form > (or has just been fully-decomposed from a fully-composed form) * > > OR are there circumstances where even those two cases cannot be relied upon? > > On 8 December 2015 at 19:20, Dale Henrichs > <[hidden email]> wrote: >> >> On 12/07/2015 11:31 PM, H. Hirzel wrote: >>> Dale >>> >>> Thank you for your answer with links to the ICU library and the notes >>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a >>> subclass of ByteArray. >>> >>> I understand that Gemstone uses the ICU library and thus does not >>> implement the algorithms in Smalltalk. >>> >>> I am currently looking into what the ICU library provides. >>> >>> I found as well a Ruby library [2] which implements CLDR [3] >>> >>> It has methods like this >>> >>> "Alphabetize a list using regular Ruby sort:" >>> >>> $> ["Art", "Wasa", "Älg", "Ved"].sort >>> $> ["Art", "Ved", "Wasa", "Älg"] >>> >>> Alphabetize a list using TwitterCLDR’s locale-aware sort: >>> >>> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a >>> $> ["Älg", "Art", "Ved", "Wasa"] >>> >>> I hope that given such an example it would not be too difficult to >>> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently >>> the interest is in getting sorting done in a cross-dialect-way. >>> >> I think that the issue (from a performance perspective) is that you can't >> depend upon the value of the code point when doing collation --- the main >> algorithm[5] is pretty much table based --- In addition to the different >> sort orders based on characters there are even more arcane sort rules where >> characters at the end of a word can affect the sort order of the word (for >> more info see[4]). >> >> It is worth looking at the Conformance section of the Unicode spec[1] as >> there are different levels of collation conformance ..... >> >> ICU conforms[2] to to UTS #10[3], the highest level of conformance ... >> >> It looks like TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7]. >> They don't claim to be conformant to the Unicode Collation Algorithm[3], but >> they are covering a big chunk of the standard use cases .... >> >> Dale >> >> [1] http://unicode.org/reports/tr10/#Conformance >> [2] http://userguide.icu-project.org/collation >> [3] http://www.unicode.org/reports/tr10/ >> [4] http://www.unicode.org/reports/tr10/#Introduction >> [5] http://www.unicode.org/reports/tr10/#Main_Algorithm >> [6] >> https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby >> [7] http://unicode.org/reports/tr10/#Tailoring |
In reply to this post by Dale Henrichs-3
Dale,
yes - sorting based on the value of codepoints is almost always guaranteed to be wrong. Sorting is an application-specific issue, not a technical Unicode issue, as there is more than one canonical sort order per culture, and there is often more than one culture per writing system. e.g. ISO Latin 1 / Latin 9 covers these cultures (amongst others) English (2 sort orders); Spanish; French (2 sort orders); German (2 sort orders); Swedish; etc German sort order differs from Swedish for the same characters, etc Todd, My thinking is that if we implement fully-composed strings as heterogenous arrays, we sidestep a lot of the complexity of the ICU. If it turns out that the performance is terrible, we can then seek to incorporate the ICU. On 8 December 2015 at 22:36, Todd Blanchard <[hidden email]> wrote: > I just want to second Dale's endorsement of the ICU library. It has been > around a long time (originally developed by Taligent) and it provides the > base unicode capabilities for an awful lot of software. > > I think it would make more sense to bring icu into Smalltalk as a > NativeBoost library than to spend resources reimplementing and maintaining > it. > > -Todd Blanchard > > On Dec 8, 2015, at 11:20, Dale Henrichs <[hidden email]> > wrote: > > On 12/07/2015 11:31 PM, H. Hirzel wrote: > > Dale > > Thank you for your answer with links to the ICU library and the notes > about classes in Gemstone. Noteworthy that you have a class Utf8 as a > subclass of ByteArray. > > I understand that Gemstone uses the ICU library and thus does not > implement the algorithms in Smalltalk. > > I am currently looking into what the ICU library provides. > > |
Equally old are the NextStep Object C functions which are now embodied
within MacOS X. It might also be informative for us to see what's done there. On 8 December 2015 at 23:50, EuanM <[hidden email]> wrote: > Dale, > > yes - sorting based on the value of codepoints is almost always > guaranteed to be wrong. Sorting is an application-specific issue, not > a technical Unicode issue, as there is more than one canonical sort > order per culture, and there is often more than one culture per > writing system. > > e.g. ISO Latin 1 / Latin 9 > covers these cultures (amongst others) > English (2 sort orders); Spanish; French (2 sort orders); German (2 > sort orders); Swedish; etc > > German sort order differs from Swedish for the same characters, etc > > Todd, > > My thinking is that if we implement fully-composed strings as > heterogenous arrays, we sidestep a lot of the complexity of the ICU. > > If it turns out that the performance is terrible, we can then seek to > incorporate the ICU. > > > On 8 December 2015 at 22:36, Todd Blanchard <[hidden email]> wrote: >> I just want to second Dale's endorsement of the ICU library. It has been >> around a long time (originally developed by Taligent) and it provides the >> base unicode capabilities for an awful lot of software. >> >> I think it would make more sense to bring icu into Smalltalk as a >> NativeBoost library than to spend resources reimplementing and maintaining >> it. >> >> -Todd Blanchard >> >> On Dec 8, 2015, at 11:20, Dale Henrichs <[hidden email]> >> wrote: >> >> On 12/07/2015 11:31 PM, H. Hirzel wrote: >> >> Dale >> >> Thank you for your answer with links to the ICU library and the notes >> about classes in Gemstone. Noteworthy that you have a class Utf8 as a >> subclass of ByteArray. >> >> I understand that Gemstone uses the ICU library and thus does not >> implement the algorithms in Smalltalk. >> >> I am currently looking into what the ICU library provides. >> >> |
In reply to this post by Dale Henrichs-3
Hi Todd, it's taken me til now to put my thoughts into words on this issue.
I think we should make it work first. This will allow us to gain more insight into the issues, and create documentation about the process that we, as a community, understand. If ICU is the right way to go, we can *then* "make it right". On 8 December 2015 at 22:36, Todd Blanchard <[hidden email]> wrote: > I just want to second Dale's endorsement of the ICU library. It has been > around a long time (originally developed by Taligent) and it provides the > base unicode capabilities for an awful lot of software. > > I think it would make more sense to bring icu into Smalltalk as a > NativeBoost library than to spend resources reimplementing and maintaining > it. > > -Todd Blanchard > > On Dec 8, 2015, at 11:20, Dale Henrichs <[hidden email]> > wrote: > > On 12/07/2015 11:31 PM, H. Hirzel wrote: > > Dale > > Thank you for your answer with links to the ICU library and the notes > about classes in Gemstone. Noteworthy that you have a class Utf8 as a > subclass of ByteArray. > > I understand that Gemstone uses the ICU library and thus does not > implement the algorithms in Smalltalk. > > I am currently looking into what the ICU library provides. > > |
In reply to this post by Dale Henrichs-3
Reading up http://www.unicode.org/reports/tr15/#Examples
The Unicode standard seems to require you never to make aGermanStrasse equivalent in a sort order to the ligatured version, aGermanStraße This seems counter-intuitive to me. Is there a reason for this? Have I just simply picked this up wrongly? On 8 December 2015 at 21:35, Dale Henrichs <[hidden email]> wrote: > Euan, > > What I meant is that you can't _always_ use the code point for collation, > i.e., sorting based on the value of code points is not always correct[1]. > > If I'm not mistaken the fully-composed and fully-decomposed forms can only > be used for testing the equivalence of two strings[2] ... > > The Main Algorithm[3], starts by producing a normalized form of the string, > but the subsequent steps (produce array, form sort key and compare) involves > table lookups among other things .... > > Once you've produced a sort key for a string, the sort key does use "binary > comparison" for collating , which is a byte by byte numeric comparison ... > > Dale > > [1] http://www.unicode.org/reports/tr10/#Common_Misperceptions > [2] http://unicode.org/reports/tr15/pdtr15.html > [3] http://www.unicode.org/reports/tr10/#Main_Algorithm > > > On 12/08/2015 12:22 PM, EuanM wrote: >> >> Dale - is that you can't depend on the value of a codepoint >> *unless the string is either in fully-composed form >> (or has just been fully-decomposed from a fully-composed form) * >> >> OR are there circumstances where even those two cases cannot be relied >> upon? >> >> On 8 December 2015 at 19:20, Dale Henrichs >> <[hidden email]> wrote: >>> >>> >>> On 12/07/2015 11:31 PM, H. Hirzel wrote: >>>> >>>> Dale >>>> >>>> Thank you for your answer with links to the ICU library and the notes >>>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a >>>> subclass of ByteArray. >>>> >>>> I understand that Gemstone uses the ICU library and thus does not >>>> implement the algorithms in Smalltalk. >>>> >>>> I am currently looking into what the ICU library provides. >>>> >>>> I found as well a Ruby library [2] which implements CLDR [3] >>>> >>>> It has methods like this >>>> >>>> "Alphabetize a list using regular Ruby sort:" >>>> >>>> $> ["Art", "Wasa", "Älg", "Ved"].sort >>>> $> ["Art", "Ved", "Wasa", "Älg"] >>>> >>>> Alphabetize a list using TwitterCLDR’s locale-aware sort: >>>> >>>> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a >>>> $> ["Älg", "Art", "Ved", "Wasa"] >>>> >>>> I hope that given such an example it would not be too difficult to >>>> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently >>>> the interest is in getting sorting done in a cross-dialect-way. >>>> >>> I think that the issue (from a performance perspective) is that you can't >>> depend upon the value of the code point when doing collation --- the main >>> algorithm[5] is pretty much table based --- In addition to the different >>> sort orders based on characters there are even more arcane sort rules >>> where >>> characters at the end of a word can affect the sort order of the word >>> (for >>> more info see[4]). >>> >>> It is worth looking at the Conformance section of the Unicode spec[1] as >>> there are different levels of collation conformance ..... >>> >>> ICU conforms[2] to to UTS #10[3], the highest level of conformance ... >>> >>> It looks like TwitterCLDR[6] uses the Main Algorithm[5] with >>> tailoring[7]. >>> They don't claim to be conformant to the Unicode Collation Algorithm[3], >>> but >>> they are covering a big chunk of the standard use cases .... >>> >>> Dale >>> >>> [1] http://unicode.org/reports/tr10/#Conformance >>> [2] http://userguide.icu-project.org/collation >>> [3] http://www.unicode.org/reports/tr10/ >>> [4] http://www.unicode.org/reports/tr10/#Introduction >>> [5] http://www.unicode.org/reports/tr10/#Main_Algorithm >>> [6] >>> >>> https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby >>> [7] http://unicode.org/reports/tr10/#Tailoring > > |
On 12/08/2015 04:33 PM, EuanM wrote: > Reading up http://www.unicode.org/reports/tr15/#Examples > > The Unicode standard seems to require you never to make > aGermanStrasse > equivalent in a sort order to the ligatured version, > aGermanStraße I didn't see anything in the examples related to this ... could you be a bit more specific? Dale |
http://www.unicode.org/reports/tr15/#Stable_Code_Points
Table 7, the discussion of Ligatures, (which uses the ligature of "ffi" as its example) Every time I think I'm about to grokk this standard, something like this crops up. On 9 December 2015 at 00:43, Dale Henrichs <[hidden email]> wrote: > > > On 12/08/2015 04:33 PM, EuanM wrote: >> >> Reading up http://www.unicode.org/reports/tr15/#Examples >> >> The Unicode standard seems to require you never to make >> aGermanStrasse >> equivalent in a sort order to the ligatured version, >> aGermanStraße > > I didn't see anything in the examples related to this ... could you be a bit > more specific? > > Dale |
I've just discovered the ICU's official listing for an ICU wrapper for
Smalltalk: https://schrievkrom.wordpress.com/tag/icu/ Listed on http://site.icu-project.org/related In turn, it mentions: ICU: Initial source code for Squeak/Pharo. http://ss3.gemstone.com/ss/ICU.html by Marten Feldtmann, Jan van de Sandt ICU-V2 Unicode support http://ss3.gemstone.com/ss/ICU-V2.html also by Marten Feldtmann, Jan van de Sandt On 9 December 2015 at 00:59, EuanM <[hidden email]> wrote: > http://www.unicode.org/reports/tr15/#Stable_Code_Points > Table 7, the discussion of Ligatures, (which uses the ligature of > "ffi" as its example) > > Every time I think I'm about to grokk this standard, something like > this crops up. > > > > On 9 December 2015 at 00:43, Dale Henrichs > <[hidden email]> wrote: >> >> >> On 12/08/2015 04:33 PM, EuanM wrote: >>> >>> Reading up http://www.unicode.org/reports/tr15/#Examples >>> >>> The Unicode standard seems to require you never to make >>> aGermanStrasse >>> equivalent in a sort order to the ligatured version, >>> aGermanStraße >> >> I didn't see anything in the examples related to this ... could you be a bit >> more specific? >> >> Dale |
"Accessing ICU library from Squeak/Pharo via NativeBoost interface
FFI. This is my port to Squeak/Pharo - to have same interface available under VASmalltalk and Pharo/Squeak." On 9 December 2015 at 01:30, EuanM <[hidden email]> wrote: > I've just discovered the ICU's official listing for an ICU wrapper for > Smalltalk: > https://schrievkrom.wordpress.com/tag/icu/ > > Listed on > http://site.icu-project.org/related > > In turn, it mentions: > > ICU: Initial source code for Squeak/Pharo. > http://ss3.gemstone.com/ss/ICU.html > by Marten Feldtmann, Jan van de Sandt > > ICU-V2 Unicode support > http://ss3.gemstone.com/ss/ICU-V2.html > also by Marten Feldtmann, Jan van de Sandt > > On 9 December 2015 at 00:59, EuanM <[hidden email]> wrote: >> http://www.unicode.org/reports/tr15/#Stable_Code_Points >> Table 7, the discussion of Ligatures, (which uses the ligature of >> "ffi" as its example) >> >> Every time I think I'm about to grokk this standard, something like >> this crops up. >> >> >> >> On 9 December 2015 at 00:43, Dale Henrichs >> <[hidden email]> wrote: >>> >>> >>> On 12/08/2015 04:33 PM, EuanM wrote: >>>> >>>> Reading up http://www.unicode.org/reports/tr15/#Examples >>>> >>>> The Unicode standard seems to require you never to make >>>> aGermanStrasse >>>> equivalent in a sort order to the ligatured version, >>>> aGermanStraße >>> >>> I didn't see anything in the examples related to this ... could you be a bit >>> more specific? >>> >>> Dale |
In reply to this post by EuanM
Excerpts from EuanM's message of 2015-12-09 01:59:43 +0100:
> http://www.unicode.org/reports/tr15/#Stable_Code_Points > Table 7, the discussion of Ligatures, (which uses the ligature of > "ffi" as its example) ß is not a ligature of ss, but is a different character. historically it evolved from a ligature of long s (ſ) and round s but it is no no longer a true ligature that can be decomposed without sideeffects. they are pronounced differently and there are german words where the difference of using ß vs ss results in different meaning of the word: (eg Buße vs Busse: penance vs busses) https://en.wikipedia.org/wiki/ß it is allowed to use ss as a replacement for ß only when ß itself is not available. this is similar to the german umlauts: ä,ö and ü which can be decomposed into ae, oe and ue, but those forms are a mere approximation, not equivalent. in a medium where umlauts are available, using a decomposed form can be considered an error. greetings, martin. -- eKita - the online platform for your entire academic life -- chief engineer eKita.co pike programmer pike.lysator.liu.se caudium.net societyserver.org secretary beijinglug.org mentor fossasia.org foresight developer foresightlinux.org realss.com unix sysadmin Martin Bähr working in china http://societyserver.org/mbaehr/ |
Excerpts from Martin Bähr's message of 2015-12-09 02:43:35 +0100:
> Excerpts from EuanM's message of 2015-12-09 01:59:43 +0100: > > http://www.unicode.org/reports/tr15/#Stable_Code_Points > > Table 7, the discussion of Ligatures, (which uses the ligature of > > "ffi" as its example) > > ß is not a ligature of ss, but is a different character. rereading this, i think i am wrong, in that this has nothing to do with ß vs ss. looking at the standard i also don't understand your conclusion. what the standard seems to say is, that the ffi ligature is not equivalent to plain ffi because, if i write a string with the ligature, then it is a different string than plain "ffi", because both forms are printable. on the other hand ä and a" (two different encodings for ä) are the same, because the printed forms are always identical. however that doesn't mean that they are sorted differently. german sorting rules for example explicitly state that ß and ss are sorted the same. (at least according to wikipedia :-) and surely, ffi ligature and ffi are sorted the same too. greetings, martin. -- eKita - the online platform for your entire academic life -- chief engineer eKita.co pike programmer pike.lysator.liu.se caudium.net societyserver.org secretary beijinglug.org mentor fossasia.org foresight developer foresightlinux.org realss.com unix sysadmin Martin Bähr working in china http://societyserver.org/mbaehr/ |
"The ffi_ligature (U+FB03) is not decomposed, because it has a
compatibility mapping, not a canonical mapping." Standard's text from Table 7 "what the standard seems to say is, that the ffi ligature is not equivalent to plain ffi because, if i write a string with the ligature, then it is a different string than plain "ffi", because both forms are printable. on the other hand ä and a" (two different encodings for ä) are the same, because the printed forms are always identical." --Martin I agree with that interpretation. I'm struggling to be clear about the consequences for equality testing. For sorting - every sort order can be created to go along with every encoding, so I suppose it just depends on which pre-assembled sort order is used. On 9 December 2015 at 02:36, Martin Bähr <[hidden email]> wrote: > Excerpts from Martin Bähr's message of 2015-12-09 02:43:35 +0100: >> Excerpts from EuanM's message of 2015-12-09 01:59:43 +0100: >> > http://www.unicode.org/reports/tr15/#Stable_Code_Points >> > Table 7, the discussion of Ligatures, (which uses the ligature of >> > "ffi" as its example) >> >> ß is not a ligature of ss, but is a different character. > > rereading this, i think i am wrong, in that this has nothing to do with ß vs ss. > > looking at the standard i also don't understand your conclusion. > > what the standard seems to say is, that the ffi ligature is not equivalent to > plain ffi because, if i write a string with the ligature, then it is a > different string than plain "ffi", because both forms are printable. > on the other hand ä and a" (two different encodings for ä) are the same, > because the printed forms are always identical. > > however that doesn't mean that they are sorted differently. > german sorting rules for example explicitly state that ß and ss are sorted the same. > (at least according to wikipedia :-) > and surely, ffi ligature and ffi are sorted the same too. > > greetings, martin. > > -- > eKita - the online platform for your entire academic life > -- > chief engineer eKita.co > pike programmer pike.lysator.liu.se caudium.net societyserver.org > secretary beijinglug.org > mentor fossasia.org > foresight developer foresightlinux.org realss.com > unix sysadmin > Martin Bähr working in china http://societyserver.org/mbaehr/ |
In reply to this post by Dale Henrichs-3
On 08-12-15 22:35, Dale Henrichs wrote:
> What I meant is that you can't _always_ use the code point for > collation, i.e., sorting based on the value of code points is not always > correct[1]. I have given up on universal sorting when I learned that dutch libraries sorting of author names depends on the country of origin of the author. So if Jan van Beek is dutch he will be sorted under B, while if he's belgian under V. I haven't checked what happens if the author emigrates, or changes nationality... Stephan |
Hi Stephan
What you mention is an edge case. A regular case which is not implemented yet is language insensitive sorting http://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf <citation> In some circumstances, an application may need to do language-insensitive sorting—that is, sorting of textual data without consideration of language-specific cultural expectations about how strings should be ordered. </citation> This is currently not the case as there is no normalization. As the Unicode Character Database [1] is already available in Squeak / Pharo (and could easily be loaded into Cuis) the implementation effort for language-insensitive sorting seems to be in reach without a big effort. [1] http://wiki.squeak.org/squeak/6244 Hannes On 12/9/15, Stephan Eggermont <[hidden email]> wrote: > On 08-12-15 22:35, Dale Henrichs wrote: >> What I meant is that you can't _always_ use the code point for >> collation, i.e., sorting based on the value of code points is not always >> correct[1]. > > I have given up on universal sorting when I learned that dutch libraries > sorting of author names depends on the country of origin of the author. > So if Jan van Beek is dutch he will be sorted under B, while if he's > belgian under V. I haven't checked what happens if the author emigrates, > or changes nationality... > > Stephan > > > > |
In reply to this post by Dale Henrichs-3
On 12/8/15, Todd Blanchard <[hidden email]> wrote:
> I just want to second Dale's endorsement of the ICU library. It has been > around a long time (originally developed by Taligent) and it provides the > base unicode capabilities for an awful lot of software. > > I think it would make more sense to bring icu into Smalltalk as a > NativeBoost library than to spend resources reimplementing and maintaining > it. > > -Todd Blanchard ICU already seems to be available for Smalltalk accessed through wrappers http://wiki.squeak.org/squeak/6234 --Hannes >> On Dec 8, 2015, at 11:20, Dale Henrichs <[hidden email]> >> wrote: >> >> On 12/07/2015 11:31 PM, H. Hirzel wrote: >>> Dale >>> >>> Thank you for your answer with links to the ICU library and the notes >>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a >>> subclass of ByteArray. >>> >>> I understand that Gemstone uses the ICU library and thus does not >>> implement the algorithms in Smalltalk. >>> >>> I am currently looking into what the ICU library provides. > > |
In reply to this post by EuanM
It seems that people at Twitter just did this for Ruby --
reimplementation from scratch with an API oriented towards Ruby users. A summary http://wiki.squeak.org/squeak/6263 Of course it depends on what you expect to achieve. This thread starts with aiming at getting German, French and Spanish sorting done. Plus some other similar cases. The algorithms are table driven and the tables are read into the Smalltalk image as is. As of now the Unicode Character Database is in the Squeak/Pharo image http://wiki.squeak.org/squeak/6244 Getting the sorting done does not seem to be extraordinary hard. However making use of the Smalltalk ICU wrapper is surely an option. --Hannes On 12/9/15, Todd Blanchard <[hidden email]> wrote: > They are practically the same thing. > > ICU was developed by Taligent which was a joint venture between Apple and > IBM. Makes sense that NSString and ICU's UnicodeString are pretty close in > implementation. ICU was also ported to Java for Sun by IBM. The point is - > this is a very elaborate chunk of code with far reach. If ICU is wrong on > some point - it is universally wrong and thus likely to be taken as "right" > as it is at least consistent. I think re-implementing it is folly TBH. > Just use it. > >> On Dec 8, 2015, at 15:52, EuanM <[hidden email]> wrote: >> >> Equally old are the NextStep Object C functions which are now embodied >> within MacOS X. >> > > > |
In reply to this post by EuanM
On 12/08/2015 04:59 PM, EuanM wrote: > http://www.unicode.org/reports/tr15/#Stable_Code_Points > Table 7, the discussion of Ligatures, (which uses the ligature of > "ffi" as its example) > > Every time I think I'm about to grokk this standard, something like > this crops up. > Table 7 and Table 8 are showing different normalization forms and I think Table 8 is an example where the two strings are equivalent (the are two categories of normalization: canonical equivalent [table 7] and compatibility equivalent [table 8]) .... it seems (I haven't delved this far myself) that the normalization form is chosen based on the actual set of characters and possibly some application specific choices .... So I don't think there is a fixed one-size fits all interpretation ... which makes duplicating the functionality ICU challenging.... Dale |
Free forum by Nabble | Edit this page |