Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Hannes Hirzel
Hello

According to http://www.unicode.org/cldr/charts/27/collation/de.html the German
phonebook sort order is

a A ä Ä ą̈ Ą̈ ǟ Ǟ ạ̈ Ạ̈ ḁ̈ Ḁ̈ b B c C d D e E f F g G h H i I j J k K
l L m M n N o O ö Ö ǫ̈ Ǫ̈ ȫ Ȫ ơ̈ Ơ̈ ợ̈ Ợ̈ ọ̈ Ọ̈ p P q Q r R s S ss ß t
T u U ü Ü ǘ Ǘ ǜ Ǜ ǚ Ǚ ų̈ Ų̈ ǖ Ǖ ư̈ Ư̈ ự̈ Ự̈ ụ̈ Ụ̈ ṳ̈ Ṳ̈ ṷ̈ Ṷ̈ ṵ̈ Ṵ̈ v
V w W x X y Y z Z

I wonder why it looks like this. A lot of characters which never
appear in a German text.


For Spanish there is 'traditional' and 'standard'

http://www.unicode.org/cldr/charts/27/collation/es.html

standard a A á Á b B c C d D e E é É f F g G h H i I í Í j J k K l L m
M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s S t T u U ú Ú
ü Ü v V w W x X y Y z Z

traditional a A á Á b B c C ch Ch CH cĥ Cĥ CĤ cȟ Cȟ CȞ cḧ Cḧ CḦ cḣ Cḣ
CḢ cḩ Cḩ CḨ cḥ Cḥ CḤ cḫ Cḫ CḪ cẖ Cẖ d D e E é É f F g G h H i I í Í j
J k K l L ll Ll LL lĺ Lĺ LĹ lľ Lľ LĽ lļ Lļ LĻ lḷ Lḷ LḶ lḹ Lḹ LḸ lḽ Lḽ
LḼ lḻ Lḻ LḺ m M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s
S t T u U ú Ú ü Ü v V w W x X y Y z Z

And French is not easily found
http://www.unicode.org/cldr/charts/27/collation/index.html
or seems to be defined elsewhere

http://unicode.org/repos/cldr/tags/release-27/common/collation/fr.xml

Suggestions and hints are welcome

--Hannes
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
Reply | Threaded
Open this post in threaded view
|

Re: Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Dale Henrichs-3
Hannes,

For GemStone, we are using the ICU library[1]. We have Unicode7,
Unicode16 and Unicode32 classes (subclasses of CharacterCollection) for
internal Strings and the class Utf8 (a subclass of ByteArray) for Utf8
encoded strings ...

The ICU library provides the primitive implementations for working with
the Unicode* and Utf8 classes

When we started considering Unicode support, we looked at what it would
take to support collation - our main reason for looking at Unicode in
the first place) -- and we saw just how complicated the collation rules
can be[2], we were glad to see that someone had already done the hard
work[1]...

Reconciling our legacy String implementations (String, DoubleByteString,
and QuadByteString) with the Unicode* classes was also interesting,
because the rules for Unicode equality and our legacy equality
implementation were not quite compatible.

If you are interested in more information, I can share additional
details ...

Dale

[1] http://site.icu-project.org/
[2] http://unicode.org/reports/tr10/

On 12/07/2015 11:54 AM, H. Hirzel wrote:

> Hello
>
> According to http://www.unicode.org/cldr/charts/27/collation/de.html the German
> phonebook sort order is
>
> a A ä Ä ą̈ Ą̈ ǟ Ǟ ạ̈ Ạ̈ ḁ̈ Ḁ̈ b B c C d D e E f F g G h H i I j J k K
> l L m M n N o O ö Ö ǫ̈ Ǫ̈ ȫ Ȫ ơ̈ Ơ̈ ợ̈ Ợ̈ ọ̈ Ọ̈ p P q Q r R s S ss ß t
> T u U ü Ü ǘ Ǘ ǜ Ǜ ǚ Ǚ ų̈ Ų̈ ǖ Ǖ ư̈ Ư̈ ự̈ Ự̈ ụ̈ Ụ̈ ṳ̈ Ṳ̈ ṷ̈ Ṷ̈ ṵ̈ Ṵ̈ v
> V w W x X y Y z Z
>
> I wonder why it looks like this. A lot of characters which never
> appear in a German text.
>
>
> For Spanish there is 'traditional' and 'standard'
>
> http://www.unicode.org/cldr/charts/27/collation/es.html
>
> standard a A á Á b B c C d D e E é É f F g G h H i I í Í j J k K l L m
> M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s S t T u U ú Ú
> ü Ü v V w W x X y Y z Z
>
> traditional a A á Á b B c C ch Ch CH cĥ Cĥ CĤ cȟ Cȟ CȞ cḧ Cḧ CḦ cḣ Cḣ
> CḢ cḩ Cḩ CḨ cḥ Cḥ CḤ cḫ Cḫ CḪ cẖ Cẖ d D e E é É f F g G h H i I í Í j
> J k K l L ll Ll LL lĺ Lĺ LĹ lľ Lľ LĽ lļ Lļ LĻ lḷ Lḷ LḶ lḹ Lḹ LḸ lḽ Lḽ
> LḼ lḻ Lḻ LḺ m M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s
> S t T u U ú Ú ü Ü v V w W x X y Y z Z
>
> And French is not easily found
> http://www.unicode.org/cldr/charts/27/collation/index.html
> or seems to be defined elsewhere
>
> http://unicode.org/repos/cldr/tags/release-27/common/collation/fr.xml
>
> Suggestions and hints are welcome
>
> --Hannes
> _______________________________________________
> Cuis mailing list
> [hidden email]
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org


_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
Reply | Threaded
Open this post in threaded view
|

Re: Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Hannes Hirzel
Dale

Thank you for your answer with links to the ICU library and the notes
about classes in Gemstone. Noteworthy that you have a class Utf8 as a
subclass of ByteArray.

I understand that Gemstone uses the ICU library and thus does not
implement the algorithms in Smalltalk.

I am currently looking into what the  ICU  library provides.

I found as well a Ruby library [2] which implements CLDR [3]

It has methods like this

"Alphabetize a list using regular Ruby sort:"

$> ["Art", "Wasa", "Älg", "Ved"].sort
$> ["Art", "Ved", "Wasa", "Älg"]

Alphabetize a list using TwitterCLDR’s locale-aware sort:

$> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
$> ["Älg", "Art", "Ved", "Wasa"]

I hope that given such an example it would not be too difficult to
reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
the interest is in getting sorting done in a cross-dialect-way.

--Hannes

[2] https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
[3]  Unicode Common Locale Data Repository http://cldr.unicode.org/index

On 12/7/15, Dale Henrichs <[hidden email]> wrote:

> Hannes,
>
> For GemStone, we are using the ICU library[1]. We have Unicode7,
> Unicode16 and Unicode32 classes (subclasses of CharacterCollection) for
> internal Strings and the class Utf8 (a subclass of ByteArray) for Utf8
> encoded strings ...
>
> The ICU library provides the primitive implementations for working with
> the Unicode* and Utf8 classes
>
> When we started considering Unicode support, we looked at what it would
> take to support collation - our main reason for looking at Unicode in
> the first place) -- and we saw just how complicated the collation rules
> can be[2], we were glad to see that someone had already done the hard
> work[1]...
>
> Reconciling our legacy String implementations (String, DoubleByteString,
> and QuadByteString) with the Unicode* classes was also interesting,
> because the rules for Unicode equality and our legacy equality
> implementation were not quite compatible.
>
> If you are interested in more information, I can share additional
> details ...
>
> Dale
>
> [1] http://site.icu-project.org/
> [2] http://unicode.org/reports/tr10/
>
> On 12/07/2015 11:54 AM, H. Hirzel wrote:
>> Hello
>>
>> According to http://www.unicode.org/cldr/charts/27/collation/de.html the
>> German
>> phonebook sort order is
>>
>> a A ä Ä ą̈ Ą̈ ǟ Ǟ ạ̈ Ạ̈ ḁ̈ Ḁ̈ b B c C d D e E f F g G h H i I j J k K
>> l L m M n N o O ö Ö ǫ̈ Ǫ̈ ȫ Ȫ ơ̈ Ơ̈ ợ̈ Ợ̈ ọ̈ Ọ̈ p P q Q r R s S ss ß t
>> T u U ü Ü ǘ Ǘ ǜ Ǜ ǚ Ǚ ų̈ Ų̈ ǖ Ǖ ư̈ Ư̈ ự̈ Ự̈ ụ̈ Ụ̈ ṳ̈ Ṳ̈ ṷ̈ Ṷ̈ ṵ̈ Ṵ̈ v
>> V w W x X y Y z Z
>>
>> I wonder why it looks like this. A lot of characters which never
>> appear in a German text.
>>
>>
>> For Spanish there is 'traditional' and 'standard'
>>
>> http://www.unicode.org/cldr/charts/27/collation/es.html
>>
>> standard a A á Á b B c C d D e E é É f F g G h H i I í Í j J k K l L m
>> M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s S t T u U ú Ú
>> ü Ü v V w W x X y Y z Z
>>
>> traditional a A á Á b B c C ch Ch CH cĥ Cĥ CĤ cȟ Cȟ CȞ cḧ Cḧ CḦ cḣ Cḣ
>> CḢ cḩ Cḩ CḨ cḥ Cḥ CḤ cḫ Cḫ CḪ cẖ Cẖ d D e E é É f F g G h H i I í Í j
>> J k K l L ll Ll LL lĺ Lĺ LĹ lľ Lľ LĽ lļ Lļ LĻ lḷ Lḷ LḶ lḹ Lḹ LḸ lḽ Lḽ
>> LḼ lḻ Lḻ LḺ m M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s
>> S t T u U ú Ú ü Ü v V w W x X y Y z Z
>>
>> And French is not easily found
>> http://www.unicode.org/cldr/charts/27/collation/index.html
>> or seems to be defined elsewhere
>>
>> http://unicode.org/repos/cldr/tags/release-27/common/collation/fr.xml
>>
>> Suggestions and hints are welcome
>>
>> --Hannes
>> _______________________________________________
>> Cuis mailing list
>> [hidden email]
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
> _______________________________________________
> Cuis mailing list
> [hidden email]
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org