Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Hannes Hirzel
Dale

Thank you for your answer with links to the ICU library and the notes
about classes in Gemstone. Noteworthy that you have a class Utf8 as a
subclass of ByteArray.

I understand that Gemstone uses the ICU library and thus does not
implement the algorithms in Smalltalk.

I am currently looking into what the  ICU  library provides.

I found as well a Ruby library [2] which implements CLDR [3]

It has methods like this

"Alphabetize a list using regular Ruby sort:"

$> ["Art", "Wasa", "Älg", "Ved"].sort
$> ["Art", "Ved", "Wasa", "Älg"]

Alphabetize a list using TwitterCLDR’s locale-aware sort:

$> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
$> ["Älg", "Art", "Ved", "Wasa"]

I hope that given such an example it would not be too difficult to
reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
the interest is in getting sorting done in a cross-dialect-way.

--Hannes

[2] https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
[3]  Unicode Common Locale Data Repository http://cldr.unicode.org/index

On 12/7/15, Dale Henrichs <[hidden email]> wrote:

> Hannes,
>
> For GemStone, we are using the ICU library[1]. We have Unicode7,
> Unicode16 and Unicode32 classes (subclasses of CharacterCollection) for
> internal Strings and the class Utf8 (a subclass of ByteArray) for Utf8
> encoded strings ...
>
> The ICU library provides the primitive implementations for working with
> the Unicode* and Utf8 classes
>
> When we started considering Unicode support, we looked at what it would
> take to support collation - our main reason for looking at Unicode in
> the first place) -- and we saw just how complicated the collation rules
> can be[2], we were glad to see that someone had already done the hard
> work[1]...
>
> Reconciling our legacy String implementations (String, DoubleByteString,
> and QuadByteString) with the Unicode* classes was also interesting,
> because the rules for Unicode equality and our legacy equality
> implementation were not quite compatible.
>
> If you are interested in more information, I can share additional
> details ...
>
> Dale
>
> [1] http://site.icu-project.org/
> [2] http://unicode.org/reports/tr10/
>
> On 12/07/2015 11:54 AM, H. Hirzel wrote:
>> Hello
>>
>> According to http://www.unicode.org/cldr/charts/27/collation/de.html the
>> German
>> phonebook sort order is
>>
>> a A ä Ä ą̈ Ą̈ ǟ Ǟ ạ̈ Ạ̈ ḁ̈ Ḁ̈ b B c C d D e E f F g G h H i I j J k K
>> l L m M n N o O ö Ö ǫ̈ Ǫ̈ ȫ Ȫ ơ̈ Ơ̈ ợ̈ Ợ̈ ọ̈ Ọ̈ p P q Q r R s S ss ß t
>> T u U ü Ü ǘ Ǘ ǜ Ǜ ǚ Ǚ ų̈ Ų̈ ǖ Ǖ ư̈ Ư̈ ự̈ Ự̈ ụ̈ Ụ̈ ṳ̈ Ṳ̈ ṷ̈ Ṷ̈ ṵ̈ Ṵ̈ v
>> V w W x X y Y z Z
>>
>> I wonder why it looks like this. A lot of characters which never
>> appear in a German text.
>>
>>
>> For Spanish there is 'traditional' and 'standard'
>>
>> http://www.unicode.org/cldr/charts/27/collation/es.html
>>
>> standard a A á Á b B c C d D e E é É f F g G h H i I í Í j J k K l L m
>> M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s S t T u U ú Ú
>> ü Ü v V w W x X y Y z Z
>>
>> traditional a A á Á b B c C ch Ch CH cĥ Cĥ CĤ cȟ Cȟ CȞ cḧ Cḧ CḦ cḣ Cḣ
>> CḢ cḩ Cḩ CḨ cḥ Cḥ CḤ cḫ Cḫ CḪ cẖ Cẖ d D e E é É f F g G h H i I í Í j
>> J k K l L ll Ll LL lĺ Lĺ LĹ lľ Lľ LĽ lļ Lļ LĻ lḷ Lḷ LḶ lḹ Lḹ LḸ lḽ Lḽ
>> LḼ lḻ Lḻ LḺ m M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s
>> S t T u U ú Ú ü Ü v V w W x X y Y z Z
>>
>> And French is not easily found
>> http://www.unicode.org/cldr/charts/27/collation/index.html
>> or seems to be defined elsewhere
>>
>> http://unicode.org/repos/cldr/tags/release-27/common/collation/fr.xml
>>
>> Suggestions and hints are welcome
>>
>> --Hannes
>> _______________________________________________
>> Cuis mailing list
>> [hidden email]
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
> _______________________________________________
> Cuis mailing list
> [hidden email]
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Dale Henrichs-3


On 12/07/2015 11:31 PM, H. Hirzel wrote:

> Dale
>
> Thank you for your answer with links to the ICU library and the notes
> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
> subclass of ByteArray.
>
> I understand that Gemstone uses the ICU library and thus does not
> implement the algorithms in Smalltalk.
>
> I am currently looking into what the  ICU  library provides.
>
> I found as well a Ruby library [2] which implements CLDR [3]
>
> It has methods like this
>
> "Alphabetize a list using regular Ruby sort:"
>
> $> ["Art", "Wasa", "Älg", "Ved"].sort
> $> ["Art", "Ved", "Wasa", "Älg"]
>
> Alphabetize a list using TwitterCLDR’s locale-aware sort:
>
> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
> $> ["Älg", "Art", "Ved", "Wasa"]
>
> I hope that given such an example it would not be too difficult to
> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
> the interest is in getting sorting done in a cross-dialect-way.
>

I think that the issue (from a performance perspective) is that you
can't depend upon the value of the code point when doing collation ---
the main algorithm[5] is pretty much table based --- In addition to the
different sort orders based on characters there are even more arcane
sort rules where characters at the end of a word can affect the sort
order of the word (for more info see[4]).

It is worth looking at the Conformance section of the Unicode spec[1] as
there are different levels of collation conformance .....

ICU conforms[2] to to UTS #10[3], the highest level of conformance ...

It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with
tailoring[7]. They don't claim to be conformant to the Unicode Collation
Algorithm[3], but they are covering a big chunk of the standard use
cases ....

Dale

[1] http://unicode.org/reports/tr10/#Conformance
[2] http://userguide.icu-project.org/collation
[3] http://www.unicode.org/reports/tr10/
[4] http://www.unicode.org/reports/tr10/#Introduction
[5] http://www.unicode.org/reports/tr10/#Main_Algorithm
[6]
https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
[7] http://unicode.org/reports/tr10/#Tailoring

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM
Dale - is that you can't depend on the value of a codepoint
*unless the string is either in fully-composed form
(or has just been fully-decomposed from a fully-composed form) *

OR are there circumstances where even those two cases cannot be relied upon?

On 8 December 2015 at 19:20, Dale Henrichs
<[hidden email]> wrote:

>
>
> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>
>> Dale
>>
>> Thank you for your answer with links to the ICU library and the notes
>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>> subclass of ByteArray.
>>
>> I understand that Gemstone uses the ICU library and thus does not
>> implement the algorithms in Smalltalk.
>>
>> I am currently looking into what the  ICU  library provides.
>>
>> I found as well a Ruby library [2] which implements CLDR [3]
>>
>> It has methods like this
>>
>> "Alphabetize a list using regular Ruby sort:"
>>
>> $> ["Art", "Wasa", "Älg", "Ved"].sort
>> $> ["Art", "Ved", "Wasa", "Älg"]
>>
>> Alphabetize a list using TwitterCLDR’s locale-aware sort:
>>
>> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
>> $> ["Älg", "Art", "Ved", "Wasa"]
>>
>> I hope that given such an example it would not be too difficult to
>> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
>> the interest is in getting sorting done in a cross-dialect-way.
>>
>
> I think that the issue (from a performance perspective) is that you can't
> depend upon the value of the code point when doing collation --- the main
> algorithm[5] is pretty much table based --- In addition to the different
> sort orders based on characters there are even more arcane sort rules where
> characters at the end of a word can affect the sort order of the word (for
> more info see[4]).
>
> It is worth looking at the Conformance section of the Unicode spec[1] as
> there are different levels of collation conformance .....
>
> ICU conforms[2] to to UTS #10[3], the highest level of conformance ...
>
> It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7].
> They don't claim to be conformant to the Unicode Collation Algorithm[3], but
> they are covering a big chunk of the standard use cases ....
>
> Dale
>
> [1] http://unicode.org/reports/tr10/#Conformance
> [2] http://userguide.icu-project.org/collation
> [3] http://www.unicode.org/reports/tr10/
> [4] http://www.unicode.org/reports/tr10/#Introduction
> [5] http://www.unicode.org/reports/tr10/#Main_Algorithm
> [6]
> https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
> [7] http://unicode.org/reports/tr10/#Tailoring

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Dale Henrichs-3
Euan,

What I meant is that you can't _always_ use the code point for
collation, i.e., sorting based on the value of code points is not always
correct[1].

If I'm not mistaken the fully-composed and fully-decomposed forms can
only be used for testing the  equivalence of two strings[2] ...

The Main Algorithm[3], starts by producing a normalized form of the
string, but the subsequent steps (produce array, form sort key and
compare) involves table lookups among other things ....

Once you've produced a sort key for a string, the sort key does use
"binary comparison" for collating , which is a byte by byte numeric
comparison ...

Dale

[1] http://www.unicode.org/reports/tr10/#Common_Misperceptions
[2] http://unicode.org/reports/tr15/pdtr15.html
[3] http://www.unicode.org/reports/tr10/#Main_Algorithm

On 12/08/2015 12:22 PM, EuanM wrote:

> Dale - is that you can't depend on the value of a codepoint
> *unless the string is either in fully-composed form
> (or has just been fully-decomposed from a fully-composed form) *
>
> OR are there circumstances where even those two cases cannot be relied upon?
>
> On 8 December 2015 at 19:20, Dale Henrichs
> <[hidden email]> wrote:
>>
>> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>> Dale
>>>
>>> Thank you for your answer with links to the ICU library and the notes
>>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>>> subclass of ByteArray.
>>>
>>> I understand that Gemstone uses the ICU library and thus does not
>>> implement the algorithms in Smalltalk.
>>>
>>> I am currently looking into what the  ICU  library provides.
>>>
>>> I found as well a Ruby library [2] which implements CLDR [3]
>>>
>>> It has methods like this
>>>
>>> "Alphabetize a list using regular Ruby sort:"
>>>
>>> $> ["Art", "Wasa", "Älg", "Ved"].sort
>>> $> ["Art", "Ved", "Wasa", "Älg"]
>>>
>>> Alphabetize a list using TwitterCLDR’s locale-aware sort:
>>>
>>> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
>>> $> ["Älg", "Art", "Ved", "Wasa"]
>>>
>>> I hope that given such an example it would not be too difficult to
>>> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
>>> the interest is in getting sorting done in a cross-dialect-way.
>>>
>> I think that the issue (from a performance perspective) is that you can't
>> depend upon the value of the code point when doing collation --- the main
>> algorithm[5] is pretty much table based --- In addition to the different
>> sort orders based on characters there are even more arcane sort rules where
>> characters at the end of a word can affect the sort order of the word (for
>> more info see[4]).
>>
>> It is worth looking at the Conformance section of the Unicode spec[1] as
>> there are different levels of collation conformance .....
>>
>> ICU conforms[2] to to UTS #10[3], the highest level of conformance ...
>>
>> It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7].
>> They don't claim to be conformant to the Unicode Collation Algorithm[3], but
>> they are covering a big chunk of the standard use cases ....
>>
>> Dale
>>
>> [1] http://unicode.org/reports/tr10/#Conformance
>> [2] http://userguide.icu-project.org/collation
>> [3] http://www.unicode.org/reports/tr10/
>> [4] http://www.unicode.org/reports/tr10/#Introduction
>> [5] http://www.unicode.org/reports/tr10/#Main_Algorithm
>> [6]
>> https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
>> [7] http://unicode.org/reports/tr10/#Tailoring


Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM
In reply to this post by Dale Henrichs-3
Dale,

yes - sorting based on the value of codepoints is almost always
guaranteed to be wrong.  Sorting is an application-specific issue, not
a technical Unicode issue, as there is more than one canonical sort
order per culture, and there is often more than one culture per
writing system.

e.g. ISO Latin 1 / Latin 9
covers these cultures (amongst others)
English (2 sort orders); Spanish; French (2 sort orders); German (2
sort orders); Swedish;  etc

German sort order differs from Swedish for the same characters, etc

Todd,

My thinking is that if we implement fully-composed strings as
heterogenous arrays, we sidestep a lot of the complexity of the ICU.

If it turns out that the performance is terrible, we can then seek to
incorporate the ICU.


On 8 December 2015 at 22:36, Todd Blanchard <[hidden email]> wrote:

> I just want to second Dale's endorsement of the ICU library.  It has been
> around a long time (originally developed by Taligent) and it provides the
> base unicode capabilities for an awful lot of software.
>
> I think it would make more sense to bring icu into Smalltalk as a
> NativeBoost library than to spend resources reimplementing and maintaining
> it.
>
> -Todd Blanchard
>
> On Dec 8, 2015, at 11:20, Dale Henrichs <[hidden email]>
> wrote:
>
> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>
> Dale
>
> Thank you for your answer with links to the ICU library and the notes
> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
> subclass of ByteArray.
>
> I understand that Gemstone uses the ICU library and thus does not
> implement the algorithms in Smalltalk.
>
> I am currently looking into what the  ICU  library provides.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM
Equally old are the NextStep Object C functions which are now embodied
within MacOS X.

It might also be informative for us to see what's done there.

On 8 December 2015 at 23:50, EuanM <[hidden email]> wrote:

> Dale,
>
> yes - sorting based on the value of codepoints is almost always
> guaranteed to be wrong.  Sorting is an application-specific issue, not
> a technical Unicode issue, as there is more than one canonical sort
> order per culture, and there is often more than one culture per
> writing system.
>
> e.g. ISO Latin 1 / Latin 9
> covers these cultures (amongst others)
> English (2 sort orders); Spanish; French (2 sort orders); German (2
> sort orders); Swedish;  etc
>
> German sort order differs from Swedish for the same characters, etc
>
> Todd,
>
> My thinking is that if we implement fully-composed strings as
> heterogenous arrays, we sidestep a lot of the complexity of the ICU.
>
> If it turns out that the performance is terrible, we can then seek to
> incorporate the ICU.
>
>
> On 8 December 2015 at 22:36, Todd Blanchard <[hidden email]> wrote:
>> I just want to second Dale's endorsement of the ICU library.  It has been
>> around a long time (originally developed by Taligent) and it provides the
>> base unicode capabilities for an awful lot of software.
>>
>> I think it would make more sense to bring icu into Smalltalk as a
>> NativeBoost library than to spend resources reimplementing and maintaining
>> it.
>>
>> -Todd Blanchard
>>
>> On Dec 8, 2015, at 11:20, Dale Henrichs <[hidden email]>
>> wrote:
>>
>> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>
>> Dale
>>
>> Thank you for your answer with links to the ICU library and the notes
>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>> subclass of ByteArray.
>>
>> I understand that Gemstone uses the ICU library and thus does not
>> implement the algorithms in Smalltalk.
>>
>> I am currently looking into what the  ICU  library provides.
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM
In reply to this post by Dale Henrichs-3
Hi Todd, it's taken me til now to put my thoughts into words on this issue.

I think we should make it work first.  This will allow us to gain more
insight into the issues, and create documentation about the process
that we, as a community, understand.

If ICU is the right way to go, we can *then* "make it right".




On 8 December 2015 at 22:36, Todd Blanchard <[hidden email]> wrote:

> I just want to second Dale's endorsement of the ICU library.  It has been
> around a long time (originally developed by Taligent) and it provides the
> base unicode capabilities for an awful lot of software.
>
> I think it would make more sense to bring icu into Smalltalk as a
> NativeBoost library than to spend resources reimplementing and maintaining
> it.
>
> -Todd Blanchard
>
> On Dec 8, 2015, at 11:20, Dale Henrichs <[hidden email]>
> wrote:
>
> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>
> Dale
>
> Thank you for your answer with links to the ICU library and the notes
> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
> subclass of ByteArray.
>
> I understand that Gemstone uses the ICU library and thus does not
> implement the algorithms in Smalltalk.
>
> I am currently looking into what the  ICU  library provides.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM
In reply to this post by Dale Henrichs-3
Reading up http://www.unicode.org/reports/tr15/#Examples

The Unicode standard seems to require you never to make
aGermanStrasse
equivalent in a sort order to the ligatured version,
aGermanStraße

This seems counter-intuitive to me.

Is there a reason for this?  Have I just simply picked this up wrongly?

On 8 December 2015 at 21:35, Dale Henrichs
<[hidden email]> wrote:

> Euan,
>
> What I meant is that you can't _always_ use the code point for collation,
> i.e., sorting based on the value of code points is not always correct[1].
>
> If I'm not mistaken the fully-composed and fully-decomposed forms can only
> be used for testing the  equivalence of two strings[2] ...
>
> The Main Algorithm[3], starts by producing a normalized form of the string,
> but the subsequent steps (produce array, form sort key and compare) involves
> table lookups among other things ....
>
> Once you've produced a sort key for a string, the sort key does use "binary
> comparison" for collating , which is a byte by byte numeric comparison ...
>
> Dale
>
> [1] http://www.unicode.org/reports/tr10/#Common_Misperceptions
> [2] http://unicode.org/reports/tr15/pdtr15.html
> [3] http://www.unicode.org/reports/tr10/#Main_Algorithm
>
>
> On 12/08/2015 12:22 PM, EuanM wrote:
>>
>> Dale - is that you can't depend on the value of a codepoint
>> *unless the string is either in fully-composed form
>> (or has just been fully-decomposed from a fully-composed form) *
>>
>> OR are there circumstances where even those two cases cannot be relied
>> upon?
>>
>> On 8 December 2015 at 19:20, Dale Henrichs
>> <[hidden email]> wrote:
>>>
>>>
>>> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>>>
>>>> Dale
>>>>
>>>> Thank you for your answer with links to the ICU library and the notes
>>>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>>>> subclass of ByteArray.
>>>>
>>>> I understand that Gemstone uses the ICU library and thus does not
>>>> implement the algorithms in Smalltalk.
>>>>
>>>> I am currently looking into what the  ICU  library provides.
>>>>
>>>> I found as well a Ruby library [2] which implements CLDR [3]
>>>>
>>>> It has methods like this
>>>>
>>>> "Alphabetize a list using regular Ruby sort:"
>>>>
>>>> $> ["Art", "Wasa", "Älg", "Ved"].sort
>>>> $> ["Art", "Ved", "Wasa", "Älg"]
>>>>
>>>> Alphabetize a list using TwitterCLDR’s locale-aware sort:
>>>>
>>>> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
>>>> $> ["Älg", "Art", "Ved", "Wasa"]
>>>>
>>>> I hope that given such an example it would not be too difficult to
>>>> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
>>>> the interest is in getting sorting done in a cross-dialect-way.
>>>>
>>> I think that the issue (from a performance perspective) is that you can't
>>> depend upon the value of the code point when doing collation --- the main
>>> algorithm[5] is pretty much table based --- In addition to the different
>>> sort orders based on characters there are even more arcane sort rules
>>> where
>>> characters at the end of a word can affect the sort order of the word
>>> (for
>>> more info see[4]).
>>>
>>> It is worth looking at the Conformance section of the Unicode spec[1] as
>>> there are different levels of collation conformance .....
>>>
>>> ICU conforms[2] to to UTS #10[3], the highest level of conformance ...
>>>
>>> It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with
>>> tailoring[7].
>>> They don't claim to be conformant to the Unicode Collation Algorithm[3],
>>> but
>>> they are covering a big chunk of the standard use cases ....
>>>
>>> Dale
>>>
>>> [1] http://unicode.org/reports/tr10/#Conformance
>>> [2] http://userguide.icu-project.org/collation
>>> [3] http://www.unicode.org/reports/tr10/
>>> [4] http://www.unicode.org/reports/tr10/#Introduction
>>> [5] http://www.unicode.org/reports/tr10/#Main_Algorithm
>>> [6]
>>>
>>> https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
>>> [7] http://unicode.org/reports/tr10/#Tailoring
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Dale Henrichs-3


On 12/08/2015 04:33 PM, EuanM wrote:
> Reading up http://www.unicode.org/reports/tr15/#Examples
>
> The Unicode standard seems to require you never to make
> aGermanStrasse
> equivalent in a sort order to the ligatured version,
> aGermanStraße
I didn't see anything in the examples related to this ... could you be a
bit more specific?

Dale

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM
http://www.unicode.org/reports/tr15/#Stable_Code_Points
Table 7, the discussion of Ligatures, (which uses the ligature of
"ffi" as its example)

Every time I think I'm about to grokk this standard, something like
this crops up.



On 9 December 2015 at 00:43, Dale Henrichs
<[hidden email]> wrote:

>
>
> On 12/08/2015 04:33 PM, EuanM wrote:
>>
>> Reading up http://www.unicode.org/reports/tr15/#Examples
>>
>> The Unicode standard seems to require you never to make
>> aGermanStrasse
>> equivalent in a sort order to the ligatured version,
>> aGermanStraße
>
> I didn't see anything in the examples related to this ... could you be a bit
> more specific?
>
> Dale

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM
I've just discovered the ICU's official listing for an ICU wrapper for
Smalltalk:
https://schrievkrom.wordpress.com/tag/icu/

Listed on
http://site.icu-project.org/related

In turn, it mentions:

ICU: Initial source code for Squeak/Pharo.
http://ss3.gemstone.com/ss/ICU.html
by Marten Feldtmann, Jan van de Sandt

ICU-V2 Unicode support
http://ss3.gemstone.com/ss/ICU-V2.html
also by Marten Feldtmann, Jan van de Sandt

On 9 December 2015 at 00:59, EuanM <[hidden email]> wrote:

> http://www.unicode.org/reports/tr15/#Stable_Code_Points
> Table 7, the discussion of Ligatures, (which uses the ligature of
> "ffi" as its example)
>
> Every time I think I'm about to grokk this standard, something like
> this crops up.
>
>
>
> On 9 December 2015 at 00:43, Dale Henrichs
> <[hidden email]> wrote:
>>
>>
>> On 12/08/2015 04:33 PM, EuanM wrote:
>>>
>>> Reading up http://www.unicode.org/reports/tr15/#Examples
>>>
>>> The Unicode standard seems to require you never to make
>>> aGermanStrasse
>>> equivalent in a sort order to the ligatured version,
>>> aGermanStraße
>>
>> I didn't see anything in the examples related to this ... could you be a bit
>> more specific?
>>
>> Dale

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM
"Accessing ICU library from Squeak/Pharo via NativeBoost interface
FFI. This is my port to Squeak/Pharo - to have same interface
available under VASmalltalk and Pharo/Squeak."

On 9 December 2015 at 01:30, EuanM <[hidden email]> wrote:

> I've just discovered the ICU's official listing for an ICU wrapper for
> Smalltalk:
> https://schrievkrom.wordpress.com/tag/icu/
>
> Listed on
> http://site.icu-project.org/related
>
> In turn, it mentions:
>
> ICU: Initial source code for Squeak/Pharo.
> http://ss3.gemstone.com/ss/ICU.html
> by Marten Feldtmann, Jan van de Sandt
>
> ICU-V2 Unicode support
> http://ss3.gemstone.com/ss/ICU-V2.html
> also by Marten Feldtmann, Jan van de Sandt
>
> On 9 December 2015 at 00:59, EuanM <[hidden email]> wrote:
>> http://www.unicode.org/reports/tr15/#Stable_Code_Points
>> Table 7, the discussion of Ligatures, (which uses the ligature of
>> "ffi" as its example)
>>
>> Every time I think I'm about to grokk this standard, something like
>> this crops up.
>>
>>
>>
>> On 9 December 2015 at 00:43, Dale Henrichs
>> <[hidden email]> wrote:
>>>
>>>
>>> On 12/08/2015 04:33 PM, EuanM wrote:
>>>>
>>>> Reading up http://www.unicode.org/reports/tr15/#Examples
>>>>
>>>> The Unicode standard seems to require you never to make
>>>> aGermanStrasse
>>>> equivalent in a sort order to the ligatured version,
>>>> aGermanStraße
>>>
>>> I didn't see anything in the examples related to this ... could you be a bit
>>> more specific?
>>>
>>> Dale

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Martin Bähr
In reply to this post by EuanM
Excerpts from EuanM's message of 2015-12-09 01:59:43 +0100:
> http://www.unicode.org/reports/tr15/#Stable_Code_Points
> Table 7, the discussion of Ligatures, (which uses the ligature of
> "ffi" as its example)

ß is not a ligature of ss, but is a different character.
historically it evolved from a ligature of long s (ſ) and round s
but it is no no longer a true ligature that can be decomposed without sideeffects.

they are pronounced differently and there are german words where the difference
of using ß vs ss results in different meaning of the word: (eg Buße vs Busse:
penance vs busses)

https://en.wikipedia.org/wiki/ß

it is allowed to use ss as a replacement for ß only when ß itself is not available.

this is similar to the german umlauts: ä,ö and ü which can be decomposed into
ae, oe and ue, but those forms are a mere approximation, not equivalent. in a
medium where umlauts are available, using a decomposed form can be considered
an error.

greetings, martin.

--
eKita                   -   the online platform for your entire academic life
--
chief engineer                                                       eKita.co
pike programmer      pike.lysator.liu.se    caudium.net     societyserver.org
secretary                                                      beijinglug.org
mentor                                                           fossasia.org
foresight developer  foresightlinux.org                            realss.com
unix sysadmin
Martin Bähr          working in china        http://societyserver.org/mbaehr/

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Martin Bähr
Excerpts from Martin Bähr's message of 2015-12-09 02:43:35 +0100:
> Excerpts from EuanM's message of 2015-12-09 01:59:43 +0100:
> > http://www.unicode.org/reports/tr15/#Stable_Code_Points
> > Table 7, the discussion of Ligatures, (which uses the ligature of
> > "ffi" as its example)
>
> ß is not a ligature of ss, but is a different character.

rereading this, i think i am wrong, in that this has nothing to do with ß vs ss.

looking at the standard i also don't understand your conclusion.

what the standard seems to say is, that the ffi ligature is not equivalent to
plain ffi because, if i write a string with the ligature, then it is a
different string than plain "ffi",  because both forms are printable.
on the other hand ä and a" (two different encodings for ä) are the same,
because the printed forms are always identical.

however that doesn't mean that they are sorted differently.
german sorting rules for example explicitly state that ß and ss are sorted the same.
(at least according to wikipedia :-)
and surely, ffi ligature and ffi are sorted the same too.

greetings, martin.

--
eKita                   -   the online platform for your entire academic life
--
chief engineer                                                       eKita.co
pike programmer      pike.lysator.liu.se    caudium.net     societyserver.org
secretary                                                      beijinglug.org
mentor                                                           fossasia.org
foresight developer  foresightlinux.org                            realss.com
unix sysadmin
Martin Bähr          working in china        http://societyserver.org/mbaehr/

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

EuanM
"The ffi_ligature (U+FB03) is not decomposed, because it has a
compatibility mapping, not a canonical mapping."
Standard's text from Table 7

"what the standard seems to say is, that the ffi ligature is not equivalent to
plain ffi because, if i write a string with the ligature, then it is a
different string than plain "ffi",  because both forms are printable.
on the other hand ä and a" (two different encodings for ä) are the same,
because the printed forms are always identical."  --Martin

I agree with that interpretation.

I'm struggling to be clear about the consequences for equality testing.

For sorting - every sort order can be created to go along with every
encoding, so I suppose it just depends on which pre-assembled sort
order is used.



On 9 December 2015 at 02:36, Martin Bähr
<[hidden email]> wrote:

> Excerpts from Martin Bähr's message of 2015-12-09 02:43:35 +0100:
>> Excerpts from EuanM's message of 2015-12-09 01:59:43 +0100:
>> > http://www.unicode.org/reports/tr15/#Stable_Code_Points
>> > Table 7, the discussion of Ligatures, (which uses the ligature of
>> > "ffi" as its example)
>>
>> ß is not a ligature of ss, but is a different character.
>
> rereading this, i think i am wrong, in that this has nothing to do with ß vs ss.
>
> looking at the standard i also don't understand your conclusion.
>
> what the standard seems to say is, that the ffi ligature is not equivalent to
> plain ffi because, if i write a string with the ligature, then it is a
> different string than plain "ffi",  because both forms are printable.
> on the other hand ä and a" (two different encodings for ä) are the same,
> because the printed forms are always identical.
>
> however that doesn't mean that they are sorted differently.
> german sorting rules for example explicitly state that ß and ss are sorted the same.
> (at least according to wikipedia :-)
> and surely, ffi ligature and ffi are sorted the same too.
>
> greetings, martin.
>
> --
> eKita                   -   the online platform for your entire academic life
> --
> chief engineer                                                       eKita.co
> pike programmer      pike.lysator.liu.se    caudium.net     societyserver.org
> secretary                                                      beijinglug.org
> mentor                                                           fossasia.org
> foresight developer  foresightlinux.org                            realss.com
> unix sysadmin
> Martin Bähr          working in china        http://societyserver.org/mbaehr/

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Stephan Eggermont-3
In reply to this post by Dale Henrichs-3
On 08-12-15 22:35, Dale Henrichs wrote:
> What I meant is that you can't _always_ use the code point for
> collation, i.e., sorting based on the value of code points is not always
> correct[1].

I have given up on universal sorting when I learned that dutch libraries
sorting of author names depends on the country of origin of the author.
So if Jan van Beek is dutch he will be sorted under B, while if he's
belgian under V. I haven't checked what happens if the author emigrates,
or changes nationality...

Stephan



Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Hannes Hirzel
Hi Stephan

What you mention is an edge case. A regular case which is not
implemented yet is

language insensitive sorting

http://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf
<citation>
In some circumstances, an application may need to do
language-insensitive sorting—that
is, sorting of textual data without consideration of language-specific
cultural expectations about how strings should be ordered.
</citation>

This is currently not the case as there is no normalization. As the
Unicode Character Database [1] is already available  in Squeak / Pharo
(and could easily be loaded into Cuis) the implementation effort for
language-insensitive sorting seems to be in reach without a big
effort.

[1] http://wiki.squeak.org/squeak/6244

Hannes

On 12/9/15, Stephan Eggermont <[hidden email]> wrote:

> On 08-12-15 22:35, Dale Henrichs wrote:
>> What I meant is that you can't _always_ use the code point for
>> collation, i.e., sorting based on the value of code points is not always
>> correct[1].
>
> I have given up on universal sorting when I learned that dutch libraries
> sorting of author names depends on the country of origin of the author.
> So if Jan van Beek is dutch he will be sorted under B, while if he's
> belgian under V. I haven't checked what happens if the author emigrates,
> or changes nationality...
>
> Stephan
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Hannes Hirzel
In reply to this post by Dale Henrichs-3
On 12/8/15, Todd Blanchard <[hidden email]> wrote:
> I just want to second Dale's endorsement of the ICU library.  It has been
> around a long time (originally developed by Taligent) and it provides the
> base unicode capabilities for an awful lot of software.
>
> I think it would make more sense to bring icu into Smalltalk as a
> NativeBoost library than to spend resources reimplementing and maintaining
> it.
>
> -Todd Blanchard

ICU already seems to be available for Smalltalk accessed through wrappers

http://wiki.squeak.org/squeak/6234

--Hannes

>> On Dec 8, 2015, at 11:20, Dale Henrichs <[hidden email]>
>> wrote:
>>
>> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>> Dale
>>>
>>> Thank you for your answer with links to the ICU library and the notes
>>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>>> subclass of ByteArray.
>>>
>>> I understand that Gemstone uses the ICU library and thus does not
>>> implement the algorithms in Smalltalk.
>>>
>>> I am currently looking into what the  ICU  library provides.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Hannes Hirzel
In reply to this post by EuanM
It seems that people at Twitter just did this for Ruby --
reimplementation from scratch with an API oriented towards Ruby users.

A summary
http://wiki.squeak.org/squeak/6263

Of course it depends on what you expect to achieve.
This thread starts with aiming at getting German, French and Spanish
sorting done. Plus some other similar cases.

The algorithms are table driven and the tables are read into the
Smalltalk image as is.

As of now the Unicode Character Database is in the Squeak/Pharo image
http://wiki.squeak.org/squeak/6244

Getting the sorting done does not seem to be extraordinary hard.
However making use of the Smalltalk ICU wrapper is surely an option.

--Hannes

On 12/9/15, Todd Blanchard <[hidden email]> wrote:

> They are practically the same thing.
>
> ICU was developed by Taligent which was a joint venture between Apple and
> IBM.  Makes sense that NSString and ICU's UnicodeString are pretty close in
> implementation.  ICU was also ported to Java for Sun by IBM.  The point is -
> this is a very elaborate chunk of code with far reach. If ICU is wrong on
> some point - it is universally wrong and thus likely to be taken as "right"
> as it is at least consistent.  I think re-implementing it is folly TBH.
> Just use it.
>
>> On Dec 8, 2015, at 15:52, EuanM <[hidden email]> wrote:
>>
>> Equally old are the NextStep Object C functions which are now embodied
>> within MacOS X.
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Dale Henrichs-3
In reply to this post by EuanM


On 12/08/2015 04:59 PM, EuanM wrote:
> http://www.unicode.org/reports/tr15/#Stable_Code_Points
> Table 7, the discussion of Ligatures, (which uses the ligature of
> "ffi" as its example)
>
> Every time I think I'm about to grokk this standard, something like
> this crops up.
>
Table 7 and Table 8 are showing different normalization forms and I
think  Table 8 is an example where the two strings are equivalent (the
are two categories of normalization: canonical equivalent [table 7] and
compatibility equivalent [table 8]) .... it seems (I haven't delved this
far myself) that the normalization form is chosen based on the actual
set of characters and possibly some application specific choices ....

So I don't think there is a fixed one-size fits all interpretation ...
which makes duplicating the functionality ICU challenging....

Dale

12